## Introduction

To make sense of the surrounding environment, our visual system relies on different transformations of the retinal input1. Just consider Fig. 1A. As any natural scene, this image is defined by a specific content of edges and lines. However, biological vision evolved to disclose the layout of discrete objects, hence the two giraffes in the foreground emerge as salient against the background, and the distinct contents pertaining to edges, shape, texture, and category contribute together to object perception.

Actually, each feature of Fig. 1B–D is processed across the whole visual system. The primary visual cortex (V1) provides an optimal encoding of natural image statistics based on local contrast, orientation and spatial frequencies2,3, and these low-level features significantly correlate with brain activity in higher-level visual areas4,5. Nonetheless, occipital, temporal and parietal modules also process object shape6,7,8,9 and categorical knowledge10,11,12.

Although all these features are relevant to our brain, their relative contribution in producing discrete and coherent percepts has not yet been clarified. In general, these different dimensions are interrelated and share common biases (i.e., are collinear), thus limiting the capability to disentangle their specific role13. For instance, categorical discriminations can be driven either by object shape (e.g., tools have peculiar outlines) or spatial frequencies (e.g., faces and places have specific spectral signatures14:). Consequently, object shape and category are processed by the same regions across the visual cortex, even when using a balanced set of stimuli15. Even so, the combination of multiple feature-based models describes brain object representations better than the same models tested in isolation. For instance, a magnetoencephalography (MEG) study found that combining low-level and semantic features improves the prediction accuracy of brain responses to viewed objects, suggesting that semantic information integrates with visual features during the temporal unfolding of object representations16.

To investigate the spatiotemporal dynamics of object processing, we combined model-based descriptions of pictures, MEG brain activity patterns and a statistical procedure (Relative Weights Analysis; RWA17,) that mitigate the effects of common biases across different dimensions. We ultimately determine the relative contribution across space and time of multiple feature-based representations – i.e., low-level, shape and categorical features - in producing the structure of what we perceive. First, a low-level description of the stimuli was grounded on features extracted by the early visual cortex (i.e., image contrast and spatial frequencies). Second, since shape is critical to interact with the surrounding environment18, we relied on a well-assessed, physiologically-motivated description of shape, i.e., the medial axis19. Finally, objects were also distinctively represented according to their superordinate categories.

To anticipate, we observed fast (100–150 ms) and overlapping representations of low-level properties (contrast and spatial frequencies), shape (medial-axis) and category in posterior sensors. These results may be interpreted as macroscale dynamics resulting in independent parallel processing, and may also suggest a role for shape in the refinement of categorical matching.

## Results

We employed the Relative Weights Analysis16 to reveal the proportional contribution of low-level, shape and category feature models in predicting time resolved representational geometries derived from MEG data, recorded from subjects attending to pictures representing thirty different stimuli from six semantic categories (Fig. 2).

The possible transformations of retinal input were described at three canonical steps of the object processing hierarchy, grounded on previous neurophysiological investigations. A first low-level model was computed by filtering the stimuli with a bank of Gabor filters: this model captures the arrangement of spatial frequencies in a V1-like fashion2. Then, as in previous neuroimaging investigations on the same topic9,20, we described object shape as its medial-axis transform19, that roughly describes an object as its skeleton, with each object part captured by a different branch. And finally, objects were identified by the semantic category they belong to11.

First, we assessed the collinearity between the three models, expressed as the Spearman correlation between the model RDMs (Fig. 2B). The low-level and categorical models have a correlation of ρ = 0.16, the shape model has ρ = 0.09 correlation with the categorical model, and ρ = 0.08 correlation with the low-level one.

Then, RWA was performed within a sensor space searchlight, resulting for each subject in three maps that report the time courses of the metric ε for each sensor, i.e., the proportional contribution of each model across time. RWA controls for model multicollinearity in multiple regression: its metric (ε) does not identify the impact of each model to the prediction of a dependent variable in isolation (i.e., beta weight), as in common multiple linear regressions, but considers also how each model relates to (i.e., is correlated with) the others. Thus, it reflects in a suitable manner the proportional impact of each variable on the prediction of brain activity (Fig. 2C). The single-subject maps were aggregated in group-level z-maps for each model, corrected for multiple comparisons and divided in 50 ms-long time bins for displaying purposes. Only the sensors whose corrected z-values were significant in the entire bin were retained, as displayed in Fig. 3 (black dots mark significant sensors: p < 0.05, rank test, 100,000 permutations, TFCE corrected).

Results show that the model based on low-level features (contrast and spatial frequencies) is significant at early stages after stimulus presentation (0–50 ms) in a cluster of posterior and medial sensors. This cluster expands in the lateral and anterior directions, reaching a maximum in the 100–150 ms interval, when most of the posterior sensors are significant. Shape features are instead restricted to a right posterior location in the 100–150 ms interval, and do not reach significance in the remainder of sensors and time bins. The category-based model is significant in medial and posterior sensors starting at 50–100 ms. The cluster expands to most of the posterior and lateral sensors, with a maximum spatial extent between 100 and 200 ms, then restricting to the posterior and lateral sensors in the 200–250 ms time bin. A cluster of right posterior sensors shows significant weights for the three models in the 100–150 ms time bin only. None of the models was significant in the remaining parts of the time course (before stimulus onset and after 300 ms).

Even if the task was intended to orient subjects’ efforts specifically towards high-level semantic processing, attention towards local features could account for the observed results. To this aim, we compared the responses between semantically similar and dissimilar stimuli and found no significant difference (p > 0.20; see Supplementary Fig. S1). Thus, results are likely not driven by task demand.

## Discussion

The visual machinery is a general-purpose system, relying on different representations that often are collinear or interact to each other. Here, by taking into account model collinearity, we revealed the spatiotemporal dynamics of joint feature processing within the human visual system, to assess the relative contribution of low-level, shape and category features in predicting MEG-based representations. We observed both a temporal and spatial co-occurrence of low-level, shape and categorical processing, early in time (100–150 ms) in posterior sensors. Specifically, we showed that (a) low-level features (i.e., contrast and spatial frequencies) are processed early (0–50 ms) after stimulus onset within posterior MEG sensors, spreading in time from medial to lateral locations; (b) shape coding is limited within a few right posterior sensors in a brief time window (100–150 ms) and co-occurs with low-level and categorical processing; (c) categorical representation emerges later than the onset of low-level processing and is more prolonged, but spreads within a similar pattern of sensors.

Our results demonstrate that within 100–150 ms after stimulus onset, these features are processed concurrently, suggesting that object discrimination may result from independent parallel processing (i.e., orthogonal feature-based descriptions processed with similar temporal dynamics), rather than from a strict feed-forward hierarchy. The observed spatiotemporal overlap is in line with previous neuroimaging evidence showing that category and shape are processed within the same visual regions15,21, and can be decoded in the 130–200 ms time window within the high-level visual cortex, as shown in a combined fMRI-MEG study, which focused on body parts and clothes22. Here we employed a model-based approach which also embedded low-level features, and sampled stimuli from a broader set of categorical classes. In addition, we introduced RWA to overcome multicollinearity, which was not explicitly addressed in previous studies.

Of note, our results raise questions concerning the role of shape in categorization. The synchronization between the three models in our data occurs in a time window (100–150 ms) that overlaps with those of perceptual organization (70–130 ms) and categorical recognition of visual information (>130 ms), as indicated by previous neurophysiological and functional studies in both human and nonhuman primates23,24,25,26,27,28.

Whether shape processing is needed to recognize and classify objects in a scene has not been clarified yet. The classical view that considered shape essential to recognition29 has, however, being challenged by the success of several appearance-based computational models that could perform object recognition by relying on low-level features only30. Since object segmentation occurs during passive natural image viewing31 and controls scene reconstruction25, shape analysis can be similarly triggered by object viewing also in a task for which shape is not explicitly relevant. Thus, our observation has at least two possible explanations: (a) shape processing is to some extent necessary for categorization or, alternatively, (b) it is not, but it is an automatic process occurring even when not overtly required by the task. The former hypothesis may, however, not be consistent with our results that show categorical representations occurring earlier than shape-based representations. In addition, the latter case would be in line with evidence suggesting that the extraction of object affordances – i.e., shape-related features which are able to facilitate or even trigger actions – is a fast and automatic process32,33. However, a conclusion on this topic can be reached only by further studies involving task modulation34. Of note, task is able to influence the strength of object processing late in time (>150 ms35:).

Another interesting result is the early emergence (50–100 ms) of categorical processing within the same pattern of sensors that also encode contrast and spatial frequencies. As mentioned before, object recognition has been described as occurring at 150 ms or later28. We observed category representations within posterior sensors well before (even accounting for the temporal smoothing potentially introduced by the searchlight procedure). Early occurrence of categorical processing has been observed also in previous MEG studies16,35.

In the past years, mounting evidence revealed a top-down control of neurons in the early visual cortex24,36,37,38,39,40. Moreover, in a series of elegant studies25,41,42, Neri found psychophysical evidence of a top-down predictive mechanism, comprising a progressive refinement of local image reconstruction driven by global saliency or semantic content. At the macroscale, the effects of this mechanism imply that both local (i.e., low-level features) and global (i.e., object-related) representations should be retrieved early in time (<150 ms) within the visual cortex. Our results, show early (from 50 to 200 ms), overlapping patterns for low-level and categorical processing in posterior MEG sensors, in line with this view. However, further research is needed to directly test the causal role of top-down feedbacks in controlling low-level processing within the occipital cortex, which falls beyond the original scope of this work.

A further general remark should be made. As mentioned before, multicollinearity is a pervasive property of our surrounding environment. Indeed, one of the most fascinating features of our visual system is the way it deals with correlated statistics within the natural domain, to optimally represent the retinal input43, and to make sense of the external world, through the mean of learning and generalization. Indeed, visual correspondences are the mechanism we used to evolve more abstract, categorical representations44. However, from the researcher perspective, this leads to an extreme effort in balancing dimensions of interest, or in developing orthogonal models. In addition, two further aspects should be considered: first, as shown empirically13, since different stimuli typically vary within multiple dimensions simultaneously, it is almost impossible to isolate a single dimension of interest; second, the effort in building orthogonal competing descriptions increases with the number of tested models.

Several methods have been proposed to overcome models collinearity (for a review, see45:). Within the field of neuroimaging, Lescroart, et al.46 employed a variance partitioning approach (the same method, in the domain of multiple linear regression, is known as commonality analysis – as also employed in the MEG field35), which aims at determining the explained variance for any possible subset of the models. While this analysis is able to estimate the variance unique to each partition, its main drawback is that partitions grow exponentially with the number of models: since there are $${2}^{p}-1$$ subsets for p predictors, just exploring the impact of 5 models generates 31 different subsets. In light of this, even comparing a low number of models would end up in a computationally intensive process and in the challenging task of interpreting and discussing a huge number of sub-models. Moreover, the partitions related to variance shared by different models can occasionally be negative, and the interpretation of these negative components is still matter of debate47. From this perspective, RWA is an attractive alternative, as it estimates the relative, non-negative weight of each model and does not imply to discuss more models or components than those initially considered.

Indeed, relative weights reflect in a suitable manner the proportional impact of each variable on the prediction of brain activity and - if the predictors are standardized - sum up to the total explained variance17. However, some limitations also affect RWA: the most relevant is that estimated weights are not invariant to the orthogonalization procedure employed. Though, it has been proven that, the more the orthogonal variables approximate the original variables, the more reliable the estimated weights become (for a deeper treatment of the topic, see17:). Therefore, RWA may represent a fast and appealing recipe to deal with model multicollinearity within the neuroimaging field, especially when three or more models are compared.

In conclusion, this study reveals the spatiotemporal dynamics of object processing from a model-based perspective, providing evidence in favour of an integrated perceptual mechanism in object representation.

## Methods

### Participants

Sixteen healthy right-handed volunteers (5F, age 27 ± 2) with normal or corrected to normal visual acuity participated in the study. All subjects gave informed consent to the experimental procedures and received a monetary reward for their participation. The study was approved by the Ethics Committee for research involving human participants at the University of Trento, and all the experimental procedures were conducted in accordance with the Declaration of Helsinki.

### Stimuli

Visual stimuli were colour pictures representing thirty different objects from six semantic categories (fruits, vegetables, animals, birds, tools, vehicles). The set of stimuli were used in two previous fMRI studies from our group9,12, and were controlled for psycholinguistic features and familiarity (for details, see12). Stimuli were presented using MATLAB and the Psychophysics Toolbox48, and were projected on a translucent screen placed at about 130 cm from the participant, using a Propixx DLP projector (VPixx technologies), with a refresh rate of 60 Hz and a resolution of 1280 × 1024 pixels (21.7 × 13.16°).

### Task and design

The experiment was organized in eight runs, each consisting of three blocks (see Fig. 2A). In each block, the thirty images were presented in randomized order, and participants were engaged in a semantic judgment task to ensure that they focused the attention on the stimuli49. At the beginning of each block, a binary target question (e.g., “Is it a tool?”) was shown; once subjects read the questions, they prompted the start of the block by pressing a button on a keyboard. Within each block, subjects answered (yes/no) to the question presented at the beginning using the keyboard. All pictures were presented 24 times, with a different target question for each repetition. 5 s-long resting periods preceded and followed each block, and 1 s-long resting periods followed the behavioural response to each stimulus within a block. During the resting periods, subjects had to fixate a black cross, displayed in the centre of the screen. The order of the questions was randomized across participants.

### Models

In order to predict MEG representational geometries, three different descriptions were built, representing different physiologically relevant properties of the objects seen by the subjects (see Fig. 2B). First, a low-level model, which captures the arrangement of spatial frequencies in a V1-like fashion, was employed: a GIST30 descriptor for each stimulus was derived by sampling (in a 4 × 4 grid) the responses to a bank of isotropic Gabor filters (8 orientations and 4 scales). The descriptor (consisting of a vector with 512 elements) of each stimulus was then normalized and compared to each other stimulus using the pairwise correlation distance (1 – Pearson’s r). Second, a shape model was computed. Similarly to previous neuroimaging investigations on the same topic9,20, the medial-axis transform19 was extracted from each manually segmented and binarised object silhouette. Then, shock-graphs skeletal representations were built, and their pairwise dissimilarity was computed using the ShapeMatcher algorithm (http://www.cs.toronto.edu/~dmac/ShapeMatcher/; Van Eede, et al.50), which estimates the minimum deformation needed in order to match two different shapes51. Finally, the thirty stimuli were described based on their semantic category, obtaining a binary categorical model.

### MEG data acquisition

MEG data were recorded using an Elekta VectorView system with 306-channels, 204 first order planar gradiometers and 102 magnetometers (Elekta-Neuromag Ltd., Helsinki, Finland), located in a magnetically shielded room (AK3B, Vakuumschmelze, Hanau, Germany). The sampling rate was 1 kHz. Head shapes were recorded from each participant immediately before the experiment, using a Polhemus Fastrak digitizer (Polhemus, Vermont, USA) recording the position of fiducial points (nasion, pre-auricular points) and around 500 additional points on the scalp. MEG data were synchronized with experiments timing by sending four different triggers at question presentation, first button press (after question), stimulus presentation and stimulus-related behavioural responses (button presses), respectively.

### MEG data pre-processing

MEG data pre-processing was performed using the Fieldtrip toolbox52. First, a bandpass (1–80 Hz) and a notch (50 Hz) 4th order Butterworth IIR filters were applied to the data53. Filtered signals were then cut in epochs from 500 ms before to 1 s after stimulus onset and resampled at 400 Hz. Subsequently, data were visually inspected according to a set of summary statistics (range, variance, maximum absolute amplitude, maximum z-value) to search for trials and channels affected by artefacts, using the procedure for visual artefact identification implemented in Fieldtrip; trials marked as bad were rejected and noisy sensors were reconstructed by interpolating their spatial neighbours. On average, 8% of the trials and 10% of the channels were rejected for each subject.

### Searchlight analysis

A searchlight analysis was performed using CoSMoMVPA54, retaining the MEG data from the gradiometers only. First, the time-locked patterns for the individual trials were reduced to thirty pseudo-trials (one for each stimulus)55. Searchlights were then defined for each time point of the pseudo-trials using a spatial and temporal neighbouring structure56. Each searchlight included 10 dipoles (pairs of combined gradiometers) in the spatial domain, and each time point plus the ten preceding and following it (i.e., 21 time points, 52.5 ms) in the temporal domain. Within each spatiotemporal searchlight, a time-varying representational dissimilarity matrix (RDM) was derived for the MEG data by computing the pairwise correlation distances between pattern of responses to the thirty stimuli57; prior to computing the RDM, stimulus-specific activity patterns were normalized (z-scored).

### Relative weights analysis (RWA)

In order to estimate how well each model RDM was related to MEG representational geometries, a multiple linear regression for each subject and each spatiotemporal searchlight was performed. Since some of the three models RDMs are significantly correlated the Relative Weights Analysis (RWA), introduced by Johnson17, was adopted. The metric on which RWA relies is called epsilon (ε) and reflects both the unique contribution of each model and its impact when all the other models are considered.

The RWA procedure is graphically synthetized in Fig. 2C. Basically, the models RDMs were first orthogonalized, by performing a Principal Component Analysis (PCA), and the RDMs from each spatiotemporal searchlight were regressed on the so obtained orthogonal versions of the models RDMs. Then, the regression coefficients were related back to the original model RDMs by regressing the orthogonal RDMs also on the models RDMs. Finally, for the j-th model, epsilon was calculated as:

$${\varepsilon }_{j}=\sum _{k=1}^{p}{\lambda }_{jk}^{2}{\beta }_{k}^{2}$$

where p is the number of models, $${\beta }_{k}^{2}$$ is the variance (i.e., the squared standardized regression coefficient) in each searchlight RDM accounted for by the k-th orthogonal RDM, and $${\lambda }_{jk}^{2}$$ is the variance in the j-th model accounted for by the k-th orthogonal RDM.

### Statistical analyses

The RWA analysis, performed within the spatiotemporal searchlights as described above, provided a time course of the metric (ε) for each sensor and time point. To estimate the group-level spatiotemporal distribution of weights for each of the three models, a one sample non-parametric test was performed, using a null distribution generated with 100,000 permutations (rank test), as implemented in CoSMoMVPA. Correction for multiple comparisons was made at cluster-level using a threshold-free method (TFCE58,59:). Z-values corresponding to a corrected p-value of 0.05 (one-tailed) were considered significant.