Orthogonal Representations of Object Shape and Category in Deep Convolutional Neural Networks and Human Visual Cortex

Deep Convolutional Neural Networks (CNNs) are gaining traction as the benchmark model of visual object recognition, with performance now surpassing humans. While CNNs can accurately assign one image to potentially thousands of categories, network performance could be the result of layers that are tuned to represent the visual shape of objects, rather than object category, since both are often confounded in natural images. Using two stimulus sets that explicitly dissociate shape from category, we correlate these two types of information with each layer of multiple CNNs. We also compare CNN output with fMRI activation along the human visual ventral stream by correlating artificial with neural representations. We find that CNNs encode category information independently from shape, peaking at the final fully connected layer in all tested CNN architectures. Comparing CNNs with fMRI brain data, early visual cortex (V1) and early layers of CNNs encode shape information. Anterior ventral temporal cortex encodes category information, which correlates best with the final layer of CNNs. The interaction between shape and category that is found along the human visual ventral pathway is echoed in multiple deep networks. Our results suggest CNNs represent category information independently from shape, much like the human visual system.


Methods
We aimed to determine the relationship between models of shape and category, CNNs, and neural responses in the human visual ventral pathway. We tested object shape and category representation in four top-performing CNNs and compared this with behavioural ratings of shape and category as well as human fMRI response patterns from experiments in two previous studies 16,30 . Human fMRI responses that are reported here use different ROIs than in the original studies. Some findings related to the second stimulus set have been published in previous work, which we acknowledge here and within the paper. For the second stimulus set, human behavioural ratings were compared to three of four of the CNNs we present here, in Kubilius et al. 18 . Human fMRI responses to shape and category were previously reported in Bracci & Op de Beeck 16 .
Below we describe participants, stimulus sets, CNN architectures, the neuroimaging experiments, and data analysis.
participants. All participants gave written informed consent. All experiments were approved by the Ethics Committee at KU Leuven and the University Hospitals Leuven. All methods were performed in accordance with the relevant guidelines and regulations. For the behavioural ratings, each stimulus set was rated by an independent group of participants (N = 4 for set A; N = 16 for set B). For the neuroimaging experiments, there were 15 participants (8 females, mean age of 30 years) scanned in fMRI experiment A, none whom were excluded. There were also 15 participants (8 females, mean age of 24 years) scanned for fMRI experiment B, with one subject who was excluded due to excessive head motion. All subjects had normal or corrected vision.
Stimulus sets. The stimuli in both experiments were designed to dissociate shape from category information. Both stimulus sets are grayscale images of objects on a white or grey background, centred at the origin and presented at a normal viewing angle (see Fig. 1). Set A contains 32 unique images, divided into 2 equally sized categories (animal vs non-animal) and 2 equally sized groups of shapes (low and high aspect ratio). Set B contains 54 images divided into 6 object categories (minerals, animals, fruit/veg, music, sport and tools) and 9 shape types. The model design for each stimulus set, which orthogonalises shape from category, is illustrated in Fig. 1. For additional information about the stimulus sets, refer to Ritchie and Op de Beeck 30 and Bracci and Op de Beeck 16 , for Set A and B respectively.
To confirm that shape was not predictive of category information for each of the stimulus sets, we analysed the images using low-level GIST descriptors 31 and tested how well these visual features predicted shape or category using Linear Discriminant Analysis (LDA). GIST provides a low dimensional representation of an image based on spectral and coarsely localised information. We defined the GIST descriptors to include 8 orientations over 8 scales and combined this with LDA. For Set A, we ran a two-way classification using a leave-one-level out procedure, for example, training on bar stimuli and generalising to blob stimuli to test for animacy classification. For Set B, we followed a six-way classification scheme using a leave-one-level out test procedure, permuting across all possible groups of train and test combinations and averaging across results. For example, we selected six shape clusters of the total nine, trained an LDA on GIST descriptors from five clusters (5 × 6 = 30 images) and tested whether the algorithm could predict the 6 different categories from the held out images. All six-way shape and category combinations were tested (totalling 504 possible outcomes), which were averaged across cross-validation folds to provide a summary result.
Behavioural ratings. Each stimulus set was rated on object category and shape properties by means of the multiple object arrangement method 32 . Participants rated similarity in two task contexts: for object category, "arrange the images based on the semantic similarity among objects"; for object shape, "arrange the images based on perceived object shape similarity". These models, based on behavioural data, represent the stimulus psychological space and are shown in Fig. 1A preprocessing and analysis. All imaging data was pre-processed and analysed using SPM and MATLAB.
For each participant, fMRI data was slice-time corrected, motion corrected (using spatial realignment to the first image), coregistered to each individual's anatomical scan, segmented and spatially normalised to the standard MNI template. Functional images were resampled to 3 × 3 × 3 mm voxel size and spatially smoothed by convolving with a Gaussian kernel of 6 mm FWHM for Set A and 4 mm FWHM for Set B 33 . After pre-processing, a GLM was used to model the BOLD signal for each participant, for each stimulus, at each voxel. Regressors for the GLM included each stimulus condition of interest (32 for A, 54 for B) and 6 motion correction parameters (x, y and z coordinates for translation and rotation). Each predictor had its time course modelled as a boxcar function convolved with the canonical haemodynamic response function, producing a single estimate for each voxel per predictor for every run. The beta weights fitted to each GLM were used to create Representational Dissimilarity Matrices (RDMs) for each participant (defined below). cortex: primary visual cortex (V1), and ventral temporal cortex (VTC), which was split into posterior (VTC post) and anterior (VTC ant) halves. These ROIs were chosen for their relevance in both object shape and category information processing 24 . VTC is bounded laterally by the occipitotemporal sulcus (OTS), posteriorly by the posterior transverse collateral sulcus (ptCoS) and anteriorly by the anterior tip of the mid-fusiform sulcus (MFS) 24 . ROIs were defined at the group level by combining the anatomical criteria above (using the Neuromorphometrics atlas in SPM) with functional criteria (all active voxels for the contrast of all conditions versus baseline that responded to visual information exceeding the statistically uncorrected threshold of p < 0.001 in a second-level analysis). For further details on ROI definition, please refer to Bracci, Kalfas & Op de Beeck 34 where the exact same ROI criteria were applied. We used a two-factor repeated-measures Analysis of Variance Model (ANOVA) to assess the interaction between two within-participant factors: conditions (shape, category) and area (V1, VTC post and VTC ant).
Deep neural network architectures. Each architecture consists of multiple convolutional layers followed by pooling operations and fully-connected layers. For each CNN, which was pre-trained on the ImageNet dataset 35 , we ran a forward pass of each image in the stimulus set through the network. We output the activation of weights in each layer, resulting in a matrix with size of the nodes per layer times the stimulus set (32 for A, 54 for B). We calculated 1 -correlation for each activation pattern of one stimulus with another to obtain an RDM with size N × N, where N = the number of stimulus conditions (32 × 32 for A, 54 × 54 for B). We did not include final softmax classification layers in our analysis, since we were interested in the structure of layer representations and not classification performance per se.

CaffeNet.
CaffeNet is an implementation of AlexNet 1 in the Caffe deep learning framework 36 . CaffeNet is an 8-layer convolutional neural network (CNNs) with five convolutional layers and three fully connected layers. GoogLenet. GoogLeNet 2 , also known as InceptionNet, was the top-performing architecture for image classification in ILSVRC 2014 35 . GoogLeNet is a 22-layer deep network, when counting only parameterised layers, or 27 layers deep if including pooling operations. All convolution, reduction and projection layers use rectified linear activation. The bottom layers of the network follow conventional convolutional neural network architecture, consisting of chained convolutional operations followed by max pooling. The top layers of the network replace multiple fully-connected layers with an average pooling layer, a single fully connected layer and a classification layer. The middle layers of the network differ substantially from traditional convolutional neural network structure, consisting of stacked "inception" modules, which are miniature networks containing one max pooling and 3 multi-sized convolution operations (1 × 1, 3 × 3 and 5 × 5 convolutions) in parallel configuration. Convolution operations inside inception modules are optimised with dimensionality reduction, by preceding expensive 3 × 3 and 5 × 5 convolution operations with 1 × 1 convolutions. Inception modules allow for increased width of the network, as well as depth, while maintaining a constant computational budget.

ResNet50.
ResNets are a family of extremely deep architectures that won the ILSVRC classification task in 2015 37 . ResNet50 contains 50 stacked "residual units", which use a split-transform-merge strategy to perform identity mappings in parallel to 3 × 3 convolutions with rectification. ResNets, like GoogLeNet 2 , are multi-branch architectures, containing only 2 branches (performing identity projection and 3 × 3 convolutions) instead of GoogLeNet's maximum 4 branch inception modules (performing multi-size convolutions). Identity mappings perform a key role in the architecture's success, forcing the network to preserve features, rather than learn entirely new representations at every layer, as is the case with conventional CNNs 38 . The final 3 layers of ResNet50 are identical in design to GoogleNet, performing average pooling, transformation to 1000 dimensions using full connections and softmax classification (not included in our analysis).

Representational similarity analysis.
We used Representational Similarity Analysis (RSA) to quantitatively compare CNN representations per layer with design models, behavioural ratings, and with fMRI neuroimaging data. RSA compares RDMs, which characterise the representational information in a brain or model 39 . Given a set of activity patterns (biological, behavioural or artificial) for a set of experimental conditions, the dissimilarity between patterns is computed as 1 minus the correlation across the units that compose the patterns. RDMs are symmetrical about a zero diagonal, where 0 denotes perfect correlation. RSA assesses second-order isomorphism, which is the shared similarity in structure between dissimilarity matrices 40 . Spearman rank order correlation was used to compare dissimilarity matrices, since the relationship between RDMs cannot be assumed to be linear 39 . In cases where there was any dependency relationship between shape and category RDMs (visible in the Set A behavioural data), we used partial correlation.

Results
Behavioural data. For each stimulus set, participants provided similarity judgments for the shape and category dimension (see Fig. 1, right column). For Set A, we found a significant correlation between the behavioural models for shape and category (Spearman's ρ = 0.4753, p < 0.001 permutation test with 1000 randomisations of stimulus labels) and so partial correlations were performed when carrying out RSA with Set A models (this (2020) 10:2453 | https://doi.org/10.1038/s41598-020-59175-0 www.nature.com/scientificreports www.nature.com/scientificreports/ includes comparisons with CNN layers and with brain data). For Set B, we found no significant correlation between behavioural models for shape and category (ρ = 0.006, p = 0.8209). Behavioural ratings for Set B have been previously used in Bracci & Op de Beeck 16 .
Low-level shape analysis of stimuli. Using GIST 31 descriptors of each image and combining this with LDA, we confirmed that category could not be predicted based upon these low-level descriptors whereas shape could, demonstrating that our stimulus sets were properly orthogonalised. LDA with GIST predicted shape above chance level, at 87.5% for Set A and 69% for Set B. Category was predicted below chance level, at 37.5% for Set A and 10% for Set B.
Shape and category RSA on all cnn layers for stimulus sets A and B. Figure 2 illustrates layer-by-layer RSA between the CNN representations and the shape and category models in the two stimulus sets. Note that all applications of RSA using the Set A behavioural models involved partial correlations (see Results: Behavioural). Looking across all networks, in the first layer of all CNNs, shape is already represented above the significance threshold in most cases, whereas category is not. We determined the significance threshold of every correlation by comparing it with a null distribution obtained by randomly permuting the RDM labels and then calculating dissimilarity relationships 1000 times. Shape correlations at the first layer of CNNs are lower and closer to the significance threshold for Set A (0.12 < ρ < 0.24) than Set B (0.24 < ρ < 0.36). For Set A, shape correlations remain relatively high at the final layer (0.29 < ρ < 0.59). In contrast, for Set B, shape correlation levels increase in the networks before falling in the final layers of all networks, to roughly their initial values (0.32 < ρ < 0.36). For all networks, category information remains low across the majority of layers, hovering at or below the significance level until the final few layers, where it increases above the significance threshold to peak at the final layer. At the final layer, for Set A, category correlations reach between 0.34 < ρ < 0.42. For Set B, category correlations reach between 0.24 < ρ < 0.37 at the final layer. The values of final layer correlations are presented in Table 1. Set B correlations with CaffeNet, VGG-19 and GoogLeNet are reproduced from Kubilius et al. 18 .
To investigate the interaction between shape and category and CNN layers, we tested correlation values in a 2 × 2 ANOVA with Layer (modelled linearly with intercept and slope) and Condition (Shape or Category) as factors. Table 2 summarises the statistical results of the main effects (Layer, Condition) and their interaction in CNNs and models. For Set A, across all networks, Layer has a highly significant main effect and Condition is also significant ( Table 2) which suggests that correlation values can be predicted given the CNN layer and the condition of interest (shape or category information). Their interaction is significant in GoogleNet and VGG-19, but not in CaffeNet and ResNet50, suggesting that as category information increases, shape information decreases significantly in two out of the four networks tested. For Set B, across all networks, Condition is highly significant, and Layer has a significant main effect in behavioural model correlations. This suggests that it is possible to make significant predictions of behavioural shape and category judgements given CNN layer information. For Set B, Condition is highly significant across all networks, and the interaction between Layer and Condition is only significant in CaffeNet.
In summary, across both Sets A and B, we can see that shape information gradually increases and/or wavers as the network is traversed, before falling in the final layers. Across both Sets A and B, category information is at or below the significance threshold in the initial layer before reaching the maximum value at the final layer, showing the opposite trend with shape correlations. Figure 3 contains multidimensional scaling plots of peak shape and category information for Sets A and B. Figure 4 summarises the representational similarity in three regions of interest (ROIs) along the visual ventral pathway, from low-level area V1 through to posterior and anterior VTC, compared with models of shape and category. Note that all RSA using Set A involved partial correlations (explained above in Results: Behavioural). For Set A, shape information reduces along the ventral stream, from 18% to 10%. Category information increases along the ventral pathway, from −6% to 40%. We tested RSA results using a two-factor ANOVA, with ROI (V1, VTC ant, VTC post) and Condition (category, shape) as within-subject factors. For Set A, results reveal a significant main effect for ROI (F 2, 15 = 35.81, p < 0.001), whereas the main effect of Condition (shape vs category) is not significant (F 1, 15 = 1.02). There is a significant interaction between ROI and Condition (F 2, 15 = 73.34, p < 0.001), indicating that as category information increases from V1 to VTC ant, shape information decreases. Post hoc pairwise t-tests further confirmed the dissociation between shape and category along the visual ventral stream: category divisions were able to significantly better explain the neural pattern in later ventral areas (VTC ant) relative to shape (t (15) = 5.67, p < 0.0001); whereas the opposite was true in early visual area V1, where shape was significantly more related to the neural data compared to category divisions (t (15) = 8.16, p < 0.0001).

Shape versus category information in visual ventral stream regions.
For Set B, we see a qualitatively similar trend of decreasing shape information from V1 to VTC anterior (from 18% to 4%) and increasing category information (from 1% to 6%). The two-factor ANOVA, with ROI (V1, VTC ant, VTC post) and Condition (Category, Shape), revealed that when correlating ROI representations with the models for Set B, the effect of ROI is significant (F 2, 14 = 3.79, p = 0.027), as is Condition (F 1, 14 = 33.84, p < 0.001) and there is a highly significant interaction effect between ROI and Condition (F 2, 14 = 13.33, p < 0.001). Again, pairwise t-tests further confirmed the dissociation between shape and category in visual brain regions, with Shape being significantly more related to neural data in early visual area V1 than Category (t (14) = 5.28, p = 0.0001). Category was not able to significantly explain neural patterns more in VTC ant than shape (t (14) = 1.20, p = 0.24). Overall, these results illustrate a decrease in shape combined with an increase in category going from V1 to VTC anterior. (2020) 10:2453 | https://doi.org/10.1038/s41598-020-59175-0 www.nature.com/scientificreports www.nature.com/scientificreports/ RSA for fMRi brain data and all cnn layers. Neural fMRI responses for each participant, and ROI, for Set A and Set B were correlated with the RDMs of every layer for each CNN. Results are shown in Fig. 5. For each stimulus set and network, correlation values were tested in a 2 × 3 ANOVA with Layer (modelled linearly with intercept and slope) and ROI as within subject factors. In CaffeNet, V1 and VTC posterior correlations peaked at the third convolutional layer, and VTC anterior peaks at the final layer for both stimulus sets. For both stimulus sets, the 2 × 3 ANOVA results reveal a significant main effect of ROI (Set A:     We also ran a one sample t-test and found that across all networks, the correlations between the final four layers and the activation of VTC anterior across subjects were significantly different from zero, for both datasets (Set A: p < 0.0001, Set B: p <= 0.0006). explained variance from shape versus category information in correlations between fMRi brain data and cnn layers. We partialled out shape and category when correlating between brain data and CNN layers to investigate the level of variance that could be explained by each of these factors individually. Figure 6 shows partial correlations for all CNNs and V1 (first two columns) and VTC anterior (last two columns), partialling out shape (magenta) or category (blue). Zero is indicated by the dotted line.
Looking at the first row of Fig. 6, we see that partial correlations between V1 and both datasets follow a similar pattern. When partialling out shape or category, correlations are still quite high along the convolutional layers, indicating that there is additional information contained in V1 and these CNN layers that is not accounted for by shape or category alone. Across all layers, the information contained by partialling out shape (magenta) is lower than the information contained by partialling out category (blue). The main decrease in these partial correlations is seen in the final fully connected layers, where there is near zero information related to category that is in www.nature.com/scientificreports www.nature.com/scientificreports/ common in V1 or in the final layers of CNNs. There is still some correlation related to shape (indicated by the blue line) that is preserved in V1 and in the final layers of CNNs.

Discussion
In this study, we investigated orthogonal shape and category representations in biological and artificial networks by making comparisons between: (i) CNNs and behavioural models of shape and category; (ii) behavioural models and the brain; and (iii) CNNs and the brain. First, comparing artificial networks and models, we found that CNNs represent category information as well as shape, and that category information peaks at the final layer for all tested CNNs, regardless of network depth. Second, comparing models and the brain, there is a two-way interaction between shape and category in the human visual ventral pathway, where shape is best represented earlier in V1, and category emerges later in anterior VTC. This interaction between shape and category is significant across both stimulus sets. Third, comparing artificial networks and the brain, V1 correlates highest with early to mid-level layers of deep networks, and anterior VTC correlates best with the final layer of CNNs. Across both stimulus sets and for all networks, peak correlations with V1 always occur in earlier network layers than peak correlations with anterior VTC, demonstrating that CNNs reflect a similar order of computational stages as the human ventral pathway when processing these object images.
Our results allow for a greater understanding of how shape and category are represented in deep networks and in the visual ventral pathway, in particular: (i) how differing shape and category definitions between the two stimulus sets reveal differences between low-level and high-level shape representations in CNNs and the brain; (ii) how shape and category processing along deep network layers maps onto brain regions; and iii) how careful stimulus design allows us to make better inferences about category information in the brain and in CNNs.
One major advantage of this study is that we consider two stimulus sets that carefully control shape and category to draw conclusions about their interaction and interplay, rather than broadly extrapolating results based on a single set of images. These two well-controlled stimulus sets are similar in design but differ slightly in how shape and category are defined, allowing us to extract a finer interpretation of results. Looking at the differences in shape definitions between these stimulus sets, in Set A, shape is defined with a low to high aspect ratio (described as "bar-like" or "blob-like"), while it is characterized retinotopically in Set B. Comparing CNNs and models, both low-level (Set B) and high-level (Set A) shape information is preserved until the very last layer of all networks, however there is a visible reduction in low-level compared to high-level shape information in the final layers. Comparing models and the brain, we see that the high-level (Set A) shape information remains quite high in VTC ant, compared to low-level (Set B) shape information, which reduces to correlation levels that are at or near zero. The plausible explanation for why shape information drops off in Set B but not in A, is Figure 6. Partial correlations between CNNs and V1 (first two columns) or VTC ant (last two columns). Magenta shows shape information with category partialled out, blue shows category information with shape partialled out. Error bars show SEM across subjects. that higher level regions represent a more abstract form of shape, which is factored into the design of Set A, but not B. Indeed, previous studies showed that perceived shape similarity strongly overlaps with higher-level brain representations in humans 41 , and in monkeys 12,42 . Kalfas et al. 12 found that the deepest layers of networks, rather than IT responses, correlated best with human shape similarity judgements. We also found that CNNs correlated much higher with behavioural shape judgements than fMRI. This finding suggests that there is at least some correspondence between how humans and models use shape, even though there are very likely also differences (see e.g. Baker et al. 19 and Geirhos et al. 21 ).
Considering the differences in category definitions between the stimulus sets, Set A has only two category clusters defined by the animate-inanimate division, whereas Set B has six object clusters. The number of groups clearly affects the size difference in correlation levels between category models and CNNs as well as the brain, where fewer groupings boost the signal. In the final layer of all CNNs, we see that category, as defined by animacy in Set A, reaches correlation levels up to three times the magnitude of Set B. Considering brain data, category as defined by animacy in Set A reaches six times the magnitude in VTC ant compared to Set B. However, when the contributions of category and shape are analysed separately, the difference between Sets A and B, in terms of category information, disappears. This is evident in Fig. 6, with the blue line showing a very similar peak in both datasets for all DNNs. The very high correlations for category that were observed in Fig. 5 for Set A were mainly driven by the shape component, and when this is dissociated the results across the two stimuli sets largely correspond (Fig. 6). We find that in all four networks, human similarity judgements of category are best explained by the final layer of CNNs, more so than fMRI representations in late ventral areas.
It may seem striking that in some cases the correlations for shape remain higher than those for category, even at the final layer (see Fig. 2, CaffeNet both sets and ResNet50 Set A). The shape model itself is a behavioural model, which captures high-level properties of shape that are presumably, still represented at the final layers of CNNs. If the shape model was defined instead using basic low-level shape features, we would expect more of a decrease in the final layers of CNNs.
Our use of multiple CNNs allows us to observe the influence of network depth on peak correlations with brain regions. Hong et al. 9 compared their brain data to a CNN consisting of 6 parallelised convolutional layers, finding that the model's top hidden layer was most predictive of IT response patterns and that lower layers had higher resemblance to V1-like Gabor patterns. Consistent with their findings, we also found that the final layer of CNNs had maximum correspondence with later ventral stream areas, and that earlier layers showed higher correlation with V1. Cichy et al. 14 found peak V1 correlations in the second layer of an 8-layer CNN trained for object recognition. Similarly in our experiments, we found that peak V1 correlations occurred at layer 3 in an 8-layer network (CaffeNet) for both stimulus sets. As network depth increases, peak correlations with V1 shift from earlier tiers in the network to later layers. Interestingly, some of the highest V1 correlations occur immediately prior to fully connected layers, as is the case in ResNet50 and VGG-19. Figure 5 illustrates peak V1 correlations occurring as late as the 45 th layer in ResNet50, bringing into question the explanatory value of additional processing stages in deeper networks, especially when an 8-layer network achieves similar magnitudes of correlation with V1 by the third layer. Nevertheless, while the maximum correlation values of brain regions shift to later layers in larger networks, the rank-order of correlation peaks with brain regions still matches the order of information processing along the ventral pathway. That is, correlations with V1 always peak before VTC ant, regardless of network depth. We extend upon the findings of Cichy et al. 14 that early visual area correlations peak before later ones, from a single 8 layer network to multiple networks, including a 50 layer network.
In conclusion, despite shape and category often being confounded in natural images, and the possibility for artificial neural networks to exploit this correlation when performing classification tasks, we find that deep convolutional neural networks are able to represent category information independently from low-level shape in a manner similar to higher level visual cortex in humans.

Data availability
The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.