Evidence that recurrent circuits are critical to the ventral stream’s execution of core object recognition behavior


Non-recurrent deep convolutional neural networks (CNNs) are currently the best at modeling core object recognition, a behavior that is supported by the densely recurrent primate ventral stream, culminating in the inferior temporal (IT) cortex. If recurrence is critical to this behavior, then primates should outperform feedforward-only deep CNNs for images that require additional recurrent processing beyond the feedforward IT response. Here we first used behavioral methods to discover hundreds of these ‘challenge’ images. Second, using large-scale electrophysiology, we observed that behaviorally sufficient object identity solutions emerged ~30 ms later in the IT cortex for challenge images compared with primate performance-matched ‘control’ images. Third, these behaviorally critical late-phase IT response patterns were poorly predicted by feedforward deep CNN activations. Notably, very-deep CNNs and shallower recurrent CNNs better predicted these late IT responses, suggesting that there is a functional equivalence between additional nonlinear transformations and recurrence. Beyond arguing that recurrent circuits are critical for rapid object identification, our results provide strong constraints for future recurrent model development.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Fig. 1: Behavioral screening and identification of control and challenge images.
Fig. 2: Large-scale multiunit array recordings in the macaque IT cortex.
Fig. 3: Relationship between OSTs and neural response latencies.
Fig. 4: Predicting IT neural responses with DCNN features.
Fig. 5: Comparison of backward visual masking between challenge and control images.
Fig. 6: Comparison of OST prediction strength.

Data availability

The images used in this study and the behavioral and object solution time data will be publicly available at the time of publication from our GitHub repository (https://github.com/kohitij-kar/image_metrics).

Code availability

The code to generate the associated figures will be available upon reasonable request. The images, primate behavioral scores, estimated object solution times, and the modeling results will be hosted at http://brain-score.org29.


  1. 1.

    DiCarlo, J. J., Zoccolan, D. & Rust, N. C. How does the brain solve visual object recognition? Neuron 73, 415–434 (2012).

  2. 2.

    Riesenhuber, M. & Poggio, T. Models of object recognition. Nat. Neurosci. 3, 1199–1204 (2000).

  3. 3.

    Yamins, D. L. & DiCarlo, J. J. Eight open questions in the computational modeling of higher sensory cortex. Curr. Opin. Neurobiol. 37, 114–120 (2016).

  4. 4.

    Hung, C. P., Kreiman, G., Poggio, T. & DiCarlo, J. J. Fast readout of object identity from macaque inferior temporal cortex. Science 310, 863–866 (2005).

  5. 5.

    Majaj, N. J., Hong, H., Solomon, E. A. & DiCarlo, J. J. Simple learned weighted sums of inferior temporal neuronal firing rates accurately predict human core object recognition performance. J. Neurosci. 35, 13402–13418 (2015).

  6. 6.

    Cadieu, C. F. et al. Deep neural networks rival the representation of primate IT cortex for core visual object recognition. PLoS Comput. Biol. 10, e1003963 (2014).

  7. 7.

    Yamins, D. L. et al. Performance-optimized hierarchical models predict neural responses in higher visual cortex. Proc. Natl Acad. Sci. USA 111, 8619–8624 (2014).

  8. 8.

    Guclu, U. & van Gerven, M. A. Deep neural networks reveal a gradient in the complexity of neural representations across the ventral stream. J. Neurosci. 35, 10005–10014 (2015).

  9. 9.

    Rajalingham, R., Schmidt, K. & DiCarlo, J. J. Comparison of object recognition behavior in human and monkey. J. Neurosci. 35, 12127–12136 (2015).

  10. 10.

    Rajalingham, R. et al. Large-scale, high-resolution comparison of the core visual object recognition behavior of humans, monkeys, and state-of-the-art deep artificial neural networks. J. Neurosci. 38, 7255–7269 (2018).

  11. 11.

    Rockland, K. S. & Virga, A. Terminal arbors of individual “feedback” axons projecting from area V2 to V1 in the macaque monkey: a study using immunohistochemistry of anterogradely transported Phaseolus vulgaris-leucoagglutinin. J. Comp. Neurol. 285, 54–72 (1989).

  12. 12.

    Felleman, D. J. & Van Essen, D. C. Distributed hierarchical processing in the primate cerebral cortex. Cereb. Cortex 1, 1–47 (1991).

  13. 13.

    Rockland, K. S., Saleem, K. S. & Tanaka, K. Divergent feedback connections from areas V4 and TEO in the macaque. Vis. Neurosci. 11, 579–600 (1994).

  14. 14.

    Rockland, K. S. & Van Hoesen, G. W. Direct temporal–occipital feedback connections to striate cortex (V1) in the macaque monkey. Cereb. Cortex 4, 300–313 (1994).

  15. 15.

    Thorpe, S., Fize, D. & Marlot, C. Speed of processing in the human visual system. Nature 381, 520–522 (1996).

  16. 16.

    Hinton, G. E., Dayan, P., Frey, B. J. & Neal, R. M. The “wake–sleep” algorithm for unsupervised neural networks. Science 268, 1158–1161 (1995).

  17. 17.

    Geirhos, R., et al. Comparing deep neural networks against humans: object recognition when the signal gets weaker. Preprint at arXiv https://arxiv.org/abs/1706.06969 (2017).

  18. 18.

    Lamme, V. A. & Roelfsema, P. R. The distinct modes of vision offered by feedforward and recurrent processing. Trends Neurosci. 23, 571–579 (2000).

  19. 19.

    Krizhevsky, A., Sutskever, I. & Hinton, G. E. ImageNet classification with deep convolutional neural networks. In Proc. 25th International Conference on Neural Information Processing Systems—Volume 1 (eds Pereira, F., Burges, C. J. C., Bottou, L. & Weinberger, K. Q.) 1097–1105 (Curran Associates, 2012).

  20. 20.

    Lin, T.-Y., et al. Microsoft COCO: Common objects in context. In Proc. 13th European Conference on Computer Vision (eds Fleet, D., Pajdla, T., Schiele, B. & Tuytelaars, T.) 740–755 (Springer, 2014).

  21. 21.

    Meyers, E. M., Freedman, D. J., Kreiman, G., Miller, E. K. & Poggio, T. Dynamic population coding of category information in inferior temporal and prefrontal cortex. J. Neurophysiol. 100, 1407–1419 (2008).

  22. 22.

    Oram, M. W. Contrast induced changes in response latency depend on stimulus specificity. J. Physiol. Paris 104, 167–175 (2010).

  23. 23.

    Rolls, E. T., Baylis, G. C. & Leonard, C. M. Role of low and high spatial frequencies in the face-selective responses of neurons in the cortex in the superior temporal sulcus in the monkey. Vision Res. 25, 1021–1035 (1985).

  24. 24.

    Op De Beeck, H. & Vogels, R. Spatial sensitivity of macaque inferior temporal neurons. J. Comp. Neurol. 426, 505–518 (2000).

  25. 25.

    Willenbockel, V. et al. Controlling low-level image properties: the SHINE toolbox. Behav. Res. Methods 42, 671–684 (2010).

  26. 26.

    McKee, J. L., Riesenhuber, M., Miller, E. K. & Freedman, D. J. Task dependence of visual and category representations in prefrontal and inferior temporal cortices. J. Neurosci. 34, 16065–16075 (2014).

  27. 27.

    Bugatus, L., Weiner, K. S. & Grill-Spector, K. Task alters category representations in prefrontal but not high-level visual cortex. Neuroimage 155, 437–449 (2017).

  28. 28.

    Liao, Q. & Poggio, T. Bridging the gaps between residual learning, recurrent neural networks and visual cortex. Preprint at arXiv https://arxiv.org/abs/1604.03640 (2016).

  29. 29.

    Schrimpf, M., et al. Brain-score: which artificial neural network for object recognition is most brain-like? Preprint at biorXiv https://www.biorxiv.org/content/10.1101/407007v1 (2018).

  30. 30.

    Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J. & Wojna, Z. Rethinking the inception architecture for computer vision. In Proc. 29th IEEE Conference on Computer Vision and Pattern Recognition (ed. IEEE Computer Society) 2818–2826 (IEEE Computer Society, 2016).

  31. 31.

    Szegedy, C., Ioffe, S., Vanhoucke, V. & Alemi, A. A. Inception-v4, inception-resnet and the impact of residual connections on learning. In Proc. 31st AAAI Conference on Artificial Intelligence (ed. AAAI) 4278–4284 (AAAI, 2017).

  32. 32.

    He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proc. 29th IEEE Conference on Computer Vision and Pattern Recognition (ed. IEEE Computer Society) 770–778 (IEEE Computer Society, 2016).

  33. 33.

    Russakovsky, O. et al. Imagenet large scale visual recognition challenge. Int. J. Comp. Vision 115, 211–252 (2015).

  34. 34.

    Kubilius, J., et al. CORnet: modeling the neural mechanisms of core object recognition. Preprint at biorXiv https://www.biorxiv.org/content/10.1101/408385v1 (2018).

  35. 35.

    Stojanoski, B. & Cusack, R. Time to wave good-bye to phase scrambling: creating controlled scrambled images using diffeomorphic transformations. J. Vis. 14, 6 (2014).

  36. 36.

    Fahrenfort, J. J., Scholte, H. S. & Lamme, V. A. Masking disrupts reentrant processing in human visual cortex. J. Cogn. Neurosci. 19, 1488–1497 (2007).

  37. 37.

    Elsayed, G. F., et al. Adversarial examples that fool both human and computer vision. Preprint at arXiv https://arxiv.org/abs/1802.08195 (2018).

  38. 38.

    Spoerer, C. J., McClure, P. & Kriegeskorte, N. Recurrent convolutional neural networks: a better model of biological object recognition. Front. Psychol. 8, 1551 (2017).

  39. 39.

    Tang, H. et al. Recurrent computations for visual pattern completion. Proc. Natl Acad. Sci. 115, 8835–8840 (2018).

  40. 40.

    Walther, D., Rutishauser, U., Koch, C. & Perona, P. Selective visual attention enables learning and recognition of multiple objects in cluttered scenes. Comp. Vis. Image Und. 100, 41–63 (2005).

  41. 41.

    Bichot, N. P., Heard, M. T., DeGennaro, E. M. & Desimone, R. A source for feature-based attention in the prefrontal cortex. Neuron 88, 832–844 (2015).

  42. 42.

    Simonyan, K. & Zisserman, A. Very deep convolutional networks for large-scale image recognition. Preprint at arXiv https://arxiv.org/abs/1409.1556 (2014).

  43. 43.

    Jeurissen, D. & Self, M. W. & Roelfsema, P. R. Serial grouping of 2D-image regions with object-based attention in humans. eLife 5, e14320 (2016).

  44. 44.

    Tovee, M. J. Neuronal processing. How fast is the speed of thought? Curr. Biol. 4, 1125–1127 (1994).

  45. 45.

    van Kerkoerle, T. et al. Alpha and gamma oscillations characterize feedback and feedforward processing in monkey visual cortex. Proc. Natl Acad. Sci. USA 111, 14332–14341 (2014).

  46. 46.

    Fyall, A. M., El-Shamayleh, Y., Choi, H., Shea-Brown, E. & Pasupathy, A. Dynamic representation of partially occluded objects in primate prefrontal and visual cortex. eLife 6, e25784 (2017).

  47. 47.

    Tomita, H., Ohbayashi, M., Nakahara, K., Hasegawa, I. & Miyashita, Y. Top-down signal from prefrontal cortex in executive control of memory retrieval. Nature 401, 699–703 (1999).

  48. 48.

    Bar, M. et al. Top-down facilitation of visual recognition. Proc. Natl Acad. Sci. USA 103, 449–454 (2006).

  49. 49.

    Seger, C. A. How do the basal ganglia contribute to categorization? Their roles in generalization, response selection, and learning via feedback. Neurosci. Biobehav. Rev. 32, 265–278 (2008).

  50. 50.

    Chatfield, K., Simonyan, K., Vedaldi, A. & Zisserman, A. Return of the devil in the details: Delving deep into convolutional nets. Preprint at arXiv https://arxiv.org/abs/1405.3531 (2014).

  51. 51.

    Santos, A. et al. Evaluation of autofocus functions in molecular cytogenetic analysis. J. Microsc. 188, 264–272 (1997).

  52. 52.

    Rosenholtz, R., Li, Y. & Nakano, L. Measuring visual clutter. J. Vis. 7, 11–22 (2007).

  53. 53.

    Baker, C. I., Behrmann, M. & Olson, C. R. Impact of learning on representation of parts and wholes in monkey inferotemporal cortex. Nat. Neurosci. 5, 1210–1216 (2002).

Download references


This research was primarily supported by the Office of Naval Research MURI-114407 (to J.J.D) and in part by US National Eye Institute grants R01-EY014970 (to J.J.D.) and K99-EY022671 (to E.B.I.), and the European Union’s Horizon 2020 research and innovation programme under grant agreement no 705498 (to J.K.). This work was also supported by the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF-1231216. The authors thank A. Afraz for his surgical assistance.

Author information

K.K. and J.J.D. designed the experiments. K.K., K.S., and E.B.I. carried out the experiments. K.K. performed the data analyses. K.K. and J.K performed computational modeling. K.K. and J.J.D. wrote the manuscript.

Correspondence to Kohitij Kar.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Journal peer review information Nature Neuroscience thanks Blake Richards, Pieter Roelfsema, and other anonymous reviewer(s) for their contribution to the peer review of this work.

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated supplementary information

Supplementary Figure 1 Types of images used, performances across different shallower DCNNs, and comparison of models with humans.

a) Examples of different image types used in the behavioral testing. Different image types included synthetic images containing an object in an uncorrelated background, images with blur, small object sizes, occlusion, incomplete objects, deformed objects, cluttered scenes, fused objects, and natural photographs. b) Comparison of pooled monkey behavioral performance and three DCNN models with similar architecture, VGG-S, NYU, and AlexNet. Each bar corresponds to an image. Red bars indicate the challenge images. The black dashed line shows the threshold difference (set at 1.5) used to determine the challenge images. c) Comparison of human performance (data pooled across 88 human subjects) and DCNN performance (AlexNet; ‘fc7’). Each dot represents the behavioral task performance (I1; refer Methods) for a single image. We reliably identified challenge (red dots; n = 266 images) and control (blue dots; n = 149 images) images. Error bars are bootstrapped s.e.m over 1000 resamples over n = 88 trials per image.

Supplementary Figure 2 Object by object comparison of pooled monkey performance (data pooled across 2 monkeys) and DCNN performance (AlexNet; ‘fc7’).

Each dot represents the behavioral task performance (I1; refer Methods) for a single image of the corresponding object. We reliably identified challenge (red dots) and control (blue dots) images. Error bars are bootstrapped s.e.m. across 1000 resamples for 123 trials per image. n = 132 images per object (corresponding to each sub-panel).

Supplementary Figure 3 Challenge image and object solution time estimation done separately for the MS COCO images.

a) Comparison of AlexNet (‘fc7’) performance and pooled monkey behavior on the MS COCO images (n = 200; 47 control and 38 challenge images). Errorbars show the s.e.m across 1000 resamples from 123 trials per image. b). Distribution of challenge (red) and control (blue) image OST. ΔOST was estimated at ~33 ms.

Supplementary Figure 4 Comparison of control and challenge image performance, both behavioral and neural decoding accuracy, during repeated exposures of images and for the first three trials, respectively.

a) Change of pooled monkey behavioral performance I1 with repeated exposure of the control (blue) and challenge (red) images. Each data point was estimated by pooling together 10 trials (around the trial numbers indicated in the x-axis). The figure shows that the control and challenge images did not show a different learning-curve across time after they were introduced during testing. Error bars are s.e.m across images. b) IT decode accuracies over time for control (blue) and challenge (red) images estimated for the first 3 trials per image only. This shows that the lagged solutions for the challenge images exist from the very early exposure periods of the images during the behavioral testing and is not a result of changes in IT responses due to repeated exposure (or some form of reinforcement learning). The dashed line at d’=1 was used as a threshold to approximate the difference in decoder latencies between these two image-sets. Errorbars are s.e.m. across images.

Supplementary Figure 5 Estimating how good the decoding accuracies are when trained and tested at different times.

a) and b) Temporal cross training matrix for control (n = 149) and challenge (n = 266) images respectively, shown separately. To estimate the value at each element of the matrix, we trained a IT neural population (n = 424) decoder (refer Methods) at a time ‘t1’ ms and test it at time ‘t2’ ms. c) The color denotes the percentage difference in performance from the diagonal (that is when the decoder was trained and tested at the same time point; therefore, all diagonal values are zeros). This is similar to the classification endurance (CE) metric used by Salti et al. 2015. We observed a lack of generalization across the train and test times. For instance, a closer inspection (shown in green dotted rectangle) of c) reveals that decoders trained at for example 110–120 ms (avg. OST of control images) loses greater than 50% of its decoding accuracy (shown as green *) when tested at >140 ms (avg. OST of challenge images). This suggests that object-information is coded by a dynamic population code consistent with the entry of recurrent inputs during late phases of the IT response.

Supplementary Figure 6 Controls analyses to rule out alternative hypotheses.

a) Dependence of OST on the pooled monkey I1 level. The red and the blue curves show the OST values averaged across images with behavioral I1 accuracy within the limits shown on the x-axis, for challenge (n = 67,145, 42, 12 images for each x-value) and control (n = 54,44,41,10 images for each x-value) images respectively Errorbars are s.e.m across images. b) Comparison of the onset latencies (tonset) per neuron(n = n = 424 neurons), between the 266 challenge (y-axis) and 149 control (x-axis) images averaged across images of each group. Horizontal and vertical error-bars denotes s.e.m across images. c) Examples of two images, before and after the SHINE31 (Spectrum, histogram, and intensity normalization and equalization) algorithm was implemented. d) Average IT population decodes over time after the SHINE technique was implemented, for the control (blue) and challenge (red) images. The error-bars denote s.e.m across images. The black line indicates the average behavioral I1 for the pooled monkey population across all images. The gray shaded region indicates the standard deviation of the behavioral I1 for the pooled monkey population across all images. The inset shows a comparison of the average normalized firing rates (across 424 neurons) over time, for both challenge (n = 266 images; red) and control (n = 149 images; blue) images after SHINING. Errorbars indicates s.e.m across images.

Supplementary Figure 7 Comparison of latencies in control and challenge image-evoked neural responses in area V4.

The top panel shows the placement of chronic Utah array implants in IT and V4 of two monkeys. Below it, we show the time course of normalized neural firing rates (averaged across the V4 population of 151 sites) for control (n = 149 images; blue) and challenge (n = 266 images; red) images. Errorclouds indicate s.e.m across neurons (n = 151). The distribution of average onset latencies across the control (blue) and challenge (red) images is shown in the two bottom panels respectively. These two distributions are not significantly different.

Supplementary Figure 8 Testing the dependence of the decoding lags on category selectivity of neurons and image properties.

We considered the possibility that the difference in the OST between control and challenge images for each object category is primarily driven by neurons that specifically prefer that category (object relevant neurons: number for each category shown in b). To address this, we first asked whether the object relevant neurons show a significant difference in response latency (that is Δtonset (challenge - control image) > 0) when measured for their preferred object category. a) shows 4 example object categories and the dependence of Δtonset (Δonset latency, ms: challenge - control) on neuronal object selectivity. The Spearman correlation value, R and associated p-values are denoted as insets. The top panel of c) summarizes these examples and shows that the overall Δtonset was not significantly greater than zero (unpaired t-test; p>0.5). In fact a closer inspection (top panel of c) reveals that for some objects (for example bear, elephant, dog) Δtonset was actually negative—that is, a trend for slightly shorter response latency for challenge images. Finally, to test the possibility that there was an overall trend for the most selective neurons to show a significant Δtonset, we computed the correlation between the Δtonset and the individual object selectivity per neuron, per object category as indicated in a). Bottom panel of c) shows that there was no dependence of object selectivity per neuron on the response latency differences. In sum, the later mean OST for challenge images cannot be simply explained by longer response latencies in the IT neurons that ‘care’ about the object categories. d-h) Dependence of object solution times on different image-based factors tested separately for control and challenge images. D-H shows the factors clutter, blur, contrast, size and eccentricity respectively. Despite some overall dependence of OST on one or more of these factors, Δ OST(challenge-control) is maintained ~30 ms at each tested level of these factors. The dashed lines show a linear fit of the data.

Supplementary Figure 9 Results from the passive fixation task.

a) Comparison of normalized firing rate responses (averaged across all 424 IT sites) to the control (n = 149 images; blue) and challenge images (n = 266 images; red). The initial dip in the firing rate is caused by the offset responses related to the previous stimulus. The gray bar shows the time bins for comparison of challenge vs control image responses, reported in the manuscript. b) Estimates of neural decodes over time. Each thin line represents a single control (blue) or challenge (red) image. The thick blue and red line represent the average control and challenge image decodes over time respectively. The horizontal dashed line represents the average performance across control and challenge images (gray area being the standard deviation across images). This demonstrates the lagged solution times for the challenge images. c) Drop of IT predictivity over object solution time. Errorbars shows s.e.m across 424 IT sites.

Supplementary Figure 10 Comparison of neural decodes over time between trained and untrained monkey IT cortex during the passive viewing and active discrimination tasks.

a) Results from untrained monkeys: IT population decodes over time for control (blue curve; 86 images) and challenge (red curve; 117 images) images. The threshold to estimate the decode latency, denoted by the dashed black line, was set at 1.8. The recordings were done from 168 sites (refer6). b) Results from trained monkeys during the passive viewing task: IT population decodes over time for control (blue curve) and challenge (red curve) images. The threshold to estimate the decode latency, denoted by the dashed black line, was set at 1.8. The recordings were subsampled randomly from 168 sites (out of 424; however, the selection was restricted to the left hemisphere and pIT and cIT arrays). c) Results from trained monkeys during active object discrimination tasks: IT population decodes over time for control (blue curve) and challenge (red curve) images. The threshold to estimate the decode latency, denoted by the dashed black line, was set at 1.8. The recordings were subsampled randomly from 168 sites (out of 424; however, the selection was restricted to the left hemisphere and pIT and cIT arrays). For A-C we plot the median accuracy for the corresponding timebin across all tested images for each time bin. All errorbars are s.e.m across images (n = 117 for challenge images, n = 86 for control images).

Supplementary Figure 11 Predicting IT neural responses with DCNN features.

a) Schematic of the DCNN neural fitting and prediction testing procedure. This includes three main steps. Data collection: neural responses are collected for each of the 1320 images (50 repetitions), for example shown is that of example neural site #3, across 10 ms time-bins. Mapping: We divide the images and the corresponding neural features (RTRAIN) into a 50-50 train-test split. For the train images, we compute the image evoked activations (FTRAIN) of the DCNN model from a specific layer. We then use partial least square regression to estimate the set of weights (w) and biases (β) that allows us to best predict RTRAIN from FTRAIN. Test Predictions: Once we have the best set of weights (w) and biases (β) that linearly map the model features onto the neural responses, we generate the predictions (MPRED) from this synthetic neuron for the test image evoked activations of the model FTEST. We then compare these predictions with the test image evoked neural features (RTEST) to compute the IT predictivity of the model. b) Scatterplots of IT (n = 424 neurons) predictivity (% EV) of different deep, deeper and deep-recurrent CNNs with respect to AlexNet with images (n = 319) that are solved between 150-250 ms post onset. We observe that IT predictivity of deep CNNs are not significantly different than AlexNet. However, both the deeper CNNs and late passes of CORnet (a deep-recurrent CNN) are better at IT predictivity compared to AlexNet.

Supplementary Figure 12 Comparison of internal consistency (reliability) of the IT neural responses across time with respect to other variables.

a) Reliability (or internal consistency) of neural responses as a function of time. The internal consistency was computed as a Spearman-Brown corrected correlation between two split halves (trial based) of each IT neural site’s responses across all tested images. Errorbar indicates s.e.m across neurons (n = 424 neurons) B) Normalized averaged population firing rate across time. Vertical dashed lines indicate onset and peak response latency., c) temporal profile of IT predictivity. d), object solution time distribution for challenges (red) and control (blue) images. Error-bar in C shows s.e.m across neural sites (n = 424 sites). b), c) and d) are identical to Fig. 3a, Fig. 4a, and Fig. 2c respectively.

Supplementary Figure 13 Evaluation of CORnet IT predictivity.

a) IT predictivity (% EV) computed at early (70–100 ms) response times. We observe that the earlier passes (pass 1 and pass 2) are better predictors of the early time bins and the prediction deteriorates for the later passes. b) IT predictivity (%EV) computed at late (170–200 ms) phases of IT responses. Here we observe that the late passes (especially pass 4) is better at predicting the IT response compared to the early passes. Error bars denote s.e.m across neurons (n = 424).

Supplementary Figure 14 Evaluation of a fine-tuned AlexNet (ImageNet pre-trained).

We first downloaded a version of AlexNet (pre-trained with the imagenet classification dataset). We then cropped the network at the ‘fc7’ layer, and added a customized classification layer (containing 10 output nodes; corresponding to our objects) at the backend. We then trained this network end-to-end on a subset of our images (that contained a mixture of both control and challenge images). We then tested this fine-tuned network on the rest of the held-out images. This process was repeated until all images were used as (held-out) test images, achieving a full set of image-by-image cross-validated behavioral accuracies. Although the overall performance of this fine-tuned DCNN was higher than that of the pre-trained (transfer-learned) AlexNet, all of our main findings—presence of challenge images (a), lagged IT decodes (b) and lower IT predictivity (c) for those images (n = 1320 images), were replicated using such a fine-tuned network. Errorbars in A are bootstrapped STD for I1 estimates per image. Errorbars in C are s.e.m across neurons (n = 424).

Supplementary Figure 15 IT neural predictivity (% EV) of AlexNet ‘fc7’ layer tested across time independently for the control (blue) and the challenge (red) images and IT neural predictivity of AlexNet ‘fc7’ layer trained and tested at different time bins (10 ms bins from 90 ms to 200 ms post image onset).

a) The data was divided into 20 ms time bins (starting from 90 ms to 190 ms). At each time bin, the image-response neural data from a subset of images (sub-sampled from the entire image-set) was used to train the mapping between ‘fc7’ activations and the neural response. After training, this model was tested on the responses of the control (n = 149) and challenge (n = 266) image present in the held-out test set. The procedure was repeated to get multiple tests for every control and challenge image. The figure shows that both control(blue) and challenge (red) image IT predictivity drops over time. However, the drop is significantly larger for the challenge images (significant interaction between image-type and time; F(1,4) = 6.3; p < 0.005; post hoc Turkey test shows that IT predictivity at time bins > 130 ms are significantly different between control and challenge images). Errorbars are s.e.m across neurons (n = 424). b) The diagonal of this plot (showing the strongest predictivity) corresponds to the cases where the models were trained and tested at the same time bins. Off-diagonal boxes show that IT predictivity gets worse when trained and tested at separate time bins. Of note, the strength of IT predictivity drops even along the diagonal (recapturing the phenomenon demonstrated in Fig. 4a).

Supplementary information

Supplementary Information

Supplementary Figures 1–15.

Reporting Summary

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Further reading