Computational models of category-selective brain regions enable high-throughput tests of selectivity

Cortical regions apparently selective to faces, places, and bodies have provided important evidence for domain-specific theories of human cognition, development, and evolution. But claims of category selectivity are not quantitatively precise and remain vulnerable to empirical refutation. Here we develop artificial neural network-based encoding models that accurately predict the response to novel images in the fusiform face area, parahippocampal place area, and extrastriate body area, outperforming descriptive models and experts. We use these models to subject claims of category selectivity to strong tests, by screening for and synthesizing images predicted to produce high responses. We find that these high-response-predicted images are all unambiguous members of the hypothesized preferred category for each region. These results provide accurate, image-computable encoding models of each category-selective region, strengthen evidence for domain specificity in the brain, and point the way for future research characterizing the functional organization of the brain with unprecedented computational precision.


Reviewer #3 (Remarks to the Author):
This study evaluates the fit between multiple (60) deep neural network models and human hemodynamic responses to natural images in three category-selective visual areas: FFA, PPA & EBA. Four human subjects underwent extensive fMRI scanning, measuring at least 20 repetitions of each of 185 natural images. The authors employed a well-established linear encoding approach: Particular neural network layers were evaluated as potential linear bases for predicting the hemodynamic responses to held-out natural images. The prediction accuracy (i.e., the correlation between predicted and observed responses) was assessed both at the regional average activity level and at the voxel activity level. The generality of the model's predictions was tested within and across subjects. Importantly, the authors demonstrated that the best performing model explains a large portion of the variability of the responses *within* object categories, going beyond plain category-based prediction. The authors compared the accuracy of the best performing model to the judgments of human novices with respect to the category-typicality of the images, as well as to expert predictions of the potential of different images to activate the regions. Last, the authors conducted several in silico experiments using the fitted encoding models. They searched for activating images across large natural image sets, synthesized activating stimuli using a GAN-parameterized activation maximization procedure, and used a masking algorithm to highlight activation-driving regions in the images. The in silico experiments all indicated that the maximally activating stimuli for these regions are respective category-members (e.g., faces for FFA), affirming the conventional view of the category-selectivity of these regions.
Overall, this is an important contribution to the computational modeling of high-level visual responses by deep-neural-network-based models. The study reflects a considerable body of work, with multiple models and well-motivated analyses and controls. Having said this, I have several comments about the manuscript as it is, but I believe that these issues are addressable by some additional work.
Major points 1) Unaccounted-for selection bias. The authors evaluated 60 neural networks and then focused on the result from the best performing one (Resnet-50). While 10-fold cross-validation was employed to eliminate the bias introduced when fitting linear weights to map each model's activations to the brain responses, it is unclear from the methods section whether any measures were applied in order to eliminate the selection bias imposed by picking the best model among the 60 candidates. The concern is that this selection bias might cause exaggerated estimates of the prediction accuracy. Importantly, such a bias might invalidate the comparison between the best model and the human raters. Unlike the models, the accuracy of the human raters (novices and experts alike) was not subject to any maximumtaking operation, and hence the comparison might be unfair. In other words, the manuscript currently compares the best model to the average human rater, and it does so without correcting the bias incurred due to the "winner's curse".
A second source from which selection bias can creep in is the selection of the best layer in each DNN. Here as well, it is unclear whether the bias introduced by this selection operation was accounted for. To allow for a fair assessment of the models, this bias has to be mitigated by a proper *nested* crossvalidation procedure, in which the most inner fold is used to fit the encoding parameters, the intermediate fold to select the best layer (and model), and the outer fold to obtain unbiased prediction accuracy estimates. An alternative, even stronger approach is a completely held-out test dataset. Failing to do so exaggerates the reported accuracy estimate and might bias the model comparison in favor of models with more numerous layers.
2) The image set considered in the in silico image screening is very large, but it is limited to a narrow domain: natural images. This constraint means that non-natural images are never considered by this procedure. Even if one of the regions was highly responsive to non-natural, non-category-member images (for a concrete example, consider "contextually-defined faces", Cox, Meyers & Sinha, 2004 Science), this would never be discovered by the screening procedure since it tested only natural images. In principle, model-driven image synthesis (similar to that applied by Bashivan, Kar & DiCarlo, 2019 Science) could have addressed this point. However, since the current work limits the synthesized images to those produced by a strong GAN trained on natural image corpus, the stimulus-synthesis procedure's capacity to form novel, non-object-like activating images is severely limited. I suspect that if the authors had used a less restricted activation maximization procedure, less semantically-sensible activating images would have arisen, potentially revealing incorrect (i.e., adversarial) model predictions. This point touches upon a whole can of worms associated with deep neural networks as vision models (adversarial examples, metamers, and so on). The paper's discussion in its current form does not mention this obvious shortcoming of the models.
To summarize this point, the interpretation of the image screening as affirming the category-selective nature of the regions is not completely substantiated due to the limited domain of the screening images (as well as the lack of empirical testing; see the next point). The control analysis on the "face-selective" AlexNet units is necessary but not sufficient since it addresses the case of responsiveness to noncategory-member natural images but not the case of responsiveness to non-natural images. Furthermore, announcing that the current approach is "subjecting claims of category selectivity to their strongest tests to date" (line 21 in the abstracts, as well as lines 266-267) neglects the extremely rich literature on modified, jumbled, or otherwise non-natural 'trick' stimuli that reveal non-trivial response properties of these high-level regions.
3) The distinctions between model predictions and empirical findings are somewhat blurred. The in silico experiments conducted on the fitted models are informative and contribute to the literature. In particular, I find Figure 4b particularly novel. However, all of the results from line 269 and onwards are model predictions rather than empirical neuroscientific findings. The authors did not call the subjects back to the scanner to test whether the natural and synthetic maximally activating stimuli predicted by the model are indeed maximally effective as stimuli for human cortices. While the authors explicitly acknowledge the distinction between predictions and results in some places in the paper (e.g., lines 377-378), this distinction is glossed over elsewhere. For example, in lines 311-312, the authors state: "This finding further strengthens the inferences that these regions are indeed selective for faces and places." This deduction does not rely on actual empirical testing.
4) The description of the data analysis procedure is insufficient for evaluation and replication. The authors provide some external references, but these are insufficient for understanding what exactly was done. In particular, I'm concerned with the lack of description of how the best layer within each model was selected (was there nested cross-validation as suggested above?), missing details of the linear fitting (what ridge parameters were considered and how were they selected?) and absent information about the "two-stage linear mapping function". For the latter, the authors cite Bashivan, Kar & DiCarlo 2019, but it is unclear whether the implementation described there, applied to modeling particular neurons rather than voxels, is identical to what was done here. 5) Can the authors provide any form of statistical inference of the differences between the various models? The prediction accuracy (Fig. S3) seems quite similar for the top ten models. Relevant sources of variability that contribute to the ranking uncertainty are the finite sampling of stimuli, the finite (and small) sampling of subjects, and the finite number of repetitions. 6) Reliability estimates/"noise ceilings" should be plotted for each figure that depicts prediction accuracy. The way these estimates were computed should be explicitly stated in the methods section.
7) The manuscript somewhat unfairly downplays previous works on encoding visual responses by deep neural network models. In particular, Wen, Shi, Chen & Liu (2018, Scientific Reports, reference 12 in the manuscript) have used the same Resnet-50 indicated as the best model here and performed quite extensive in silico experiments on category-selectivity. It would serve the readers better to situate the current work in the context of the existing literature. Another highly relevant work (not currently cited) is Eickenberg, Gramfort, Varoquaux, and Thirion, 2017 NeuroImage. 8) Given the well-established infrastructure for sharing fMRI data (the Openneuro platform and the BIDS format) and the high potential utility of the data collected by the current study, the proposed data sharing policy ("upon request") seems to be suboptimal.

Minor points:
9) The number of participants (four) has to be mentioned in the first Results paragraph. 10) Many data points in Figure 3c have no error bars. 11) Are the images in Figures 3c and 3e randomly sampled, best performing, or manually chosen?
12) The image synthesis section does not include necessary details for replication such as optimizer choice, and optimization hyper-parameter, stopping conditions, and GAN latent initialization.

REVIEWER COMMENTS
We are very grateful to the three reviewers for their encouraging and constructive feedback on the paper, which have greatly improved the quality of the manuscript. Our responses to these comments are given below in blue, and in each case we have modified our manuscript to address these concerns (changes made to the paper are marked in red).
We also apologize for the delay in getting back with the reviewer comments caused because of COVID-related crisis in the family.

Reviewer #1 (Remarks to the Author):
The paper, "Computational models of category-selective brain regions enable high-throughput tests of selectivity", is an easy to read, clearly argued paper in which the authors present a convincing case that FFA, PPA and EBA are indeed selective for faces, places and body parts. They do this by generating encoding models that allow for the image-computability of faces, places and bodies using DNNs and BOLD-MRI data. These models have an impressive performance in their capability of predicting out sample (both stimuli and subject). Using image-computability models they show that no black swans are found (e.g. no exemplars of the wrong category) that would challenge the idea of category selectivity for places, scenes and bodies. Furthermore, images generated using these models are hyper-exemplars of their categories. The sum of these results make this a very convincing paper for the, in itself unsurprising, main point (indeed exemplar specificity). However, in doing so they authors present an approach to determine the selectivity of brain areas in a computational manner, opening up an avenue of possibilities.
However, I do have some (somewhat) more major, and minor remarks.
Somewhat more major remarks: L200-L203, it is not clear to me if, for the voxel models, the mean ROI response (for a stimulus) is subtracted from the voxel responses. This would be necessary to make a claim on the voxel information above, and beyond, the ROI. That a voxel contains some information form the ROI is a statistical necessity that is, in itself, uninformative.

The reviewer's suggestion raises a different but interesting question: Do individual voxels represent information distinct from the mean response? We now performed additional analyses to test this specific question. We find that there is indeed some residual variance in individual voxels even after removing the contribution of the mean response, and that our best model (ResNet50) can predict this residual variance to some degree (see figure inset here which shows the median predictive accuracy across voxels after removing the mean responses, averaged across participants). This analysis foreshadows some of our other work in progress, which is to describe the diversity of selectivity observed within these regions (which requires further closed-loop experimental verification).
To summarize, we have now clarified the motivation for this specific analysis (see excerpt below) and leave the question of possibly distinct selectivities within individual voxels for more rigorous testing and evaluation in future work. "The previous analyses evaluate the ability of models to predict the pooled response averaged across voxels within each category-selective fROI in an effort to build models that generalize across subjects and directly interface with the bulk of prior experimental literature on these regions. But of course information in these regions is coded by the distributed pattern of response across the neural population. So, how well do these models predict the voxel-wise responses (similar to other modeling studies 15,16,25,[31][32][33] ) in each region?"

In this experiment, we aimed to directly compare descriptive word-based models that are usually employed to describe the function of these regions (e. g., EBA is a 'body-selective' region), with predictive models like DNNs. Specifically, we asked whether our computational models perform any better than the single word-level functional descriptions long used to characterize responses in these regions. One possible version of the word-level hypothesis is that the responses are binary: all body images should predict the same maximum response and all non-body should predict the same minimum response. But that hypothesis is a bit of a straw man, as it does not capture the graded nature of category membership in cognition and the clearly graded response of each region to distinct images. So, we further compared the performance of our models to a "graded" version of the category selectivity hypothesis. To do this we recruited novices with no knowledge about the FFA etc. and asked them to rate how clearly each category was depicted in each image. We then used these ratings as an operationalization of a graded version of the "word model" hypotheses of category selectivity. Specifically, we asked how well the category membership ratings could predict the observed responses in these fROIs. If novice category ratings predict the responses in each region as well as the models do, that would indicate that the models are simply capturing those "word model" intuitions. But we find that in fact the model outperforms these ratings in predicting responses to novel images, indicating that the model "knows" something beyond the graded version of the word model. We now try to more clearly motivate this analysis with an explicit question in the updated version:
"But how much better are these models than the previous descriptive characterizations used for these regions? Are the models simply capturing a graded version of category selectivity, much like human intuitions of graded category membership 34,35 ?" L238-L261, the fun experiment continues, but basically the same question, what would it mean if experts where better, or equal, at predicting the BOLD-response to these categories than a BOLD-predicting model. For both the novel and expert I get that it showcases the superior performance of the model, but does it anything beyond this?

This analysis (like the previous one) is a way of asking what if anything the models tell us that we did not already know. The analysis above says the models have more information than the graded version of the previously used "word models" of category selectivity. Here we ask whether the models capture anything beyond the expertise of experts in the field. The superior performance of the models compared to experts indicates that models 'know' something more about these regions than the experts do, that is they are not only providing image-computable versions of field expertise, but may actually be capturing something new. We use this finding as motivation to distill this additional knowledge within these models in subsequent analyses in the paper. Based on the reviewer feedback, we have now changed the introduction to this section to better motivate this question:
"The previous analysis demonstrates that the models do not merely provide image-computable versions of the word-level descriptions of the responses of each region, but embody further information not entailed in those "word models". Here we ask whether the models "know" more about these regions than even experts in the field do.
Minor remarks L125-L130, the vast majority of networks tested are feedforward networks, with resnet-50 being a strong 'winner'. Given the little difference between the top performing model this is not a big point but there is an emphasis on the Cornet's in these lines that are not justified by the tests performed, or the results discussed.

The reviewer is correct that differences between the top performing models are small. But the base-models also vary drastically in the model architecture, the specific images used to train these models (training diet), the specific task on which they were trained etc. So in this section we picked out comparisons that were more balanced and therefore justified (only within the specific comparisons). These include a discussion of few deep v/s shallow models, randomly initialized v/s trained models (where the architecture unchanged), trained on specific images (like faces) v/s trained on broad datasets (but architecture unchanged), recurrent and non-recurrent (base-architecture unchanged). It is only in this context that we report the comparison within the CORnet classes (specifically CORnetZ v/s CORnetR and S) because the CORnet models are a class of shallow models that differ in just one of these dimensions (in this case, varying recurrence while fixing the training diet, optimization goal and overall architecture) at a time.
L301-L303, how far do you need to go in the list before the first black-swan appears (as a percentage of all images tested), and what happens beyond this, a gradual transition of facelike, or a more abrupt transition (and fair enough, this is difficult to characterise)?

L492-L494, all subjects have robust L and R EBA, FFA and PPA's. Where they selected for this, and/or is this guaranteed to happen with enough localisers? We did not perform any subject pre-selection. In our experience, 4-5 runs of a good (face-objectscene-body) localizer (like the dynamic localizer we used in our study) are usually sufficient to robustly localize these fROIs in subjects.
Fig 2. The title is models, but only the resnet-50 results are shown. I believe the rest of the top-10 networks would be very similar but these are not shown. Is it not better to just refer to it as resnet-imagenet-50 (or an abbreviation of this)?

Reviewer #2 (Remarks to the Author):
Computational models of category-selective brain regions enable high-throughput tests of selectivity This paper measured the response properties of category-selective regions with a new high quality fMRI dataset, and then accurately modeled these responses using a deep convolutional neural network encoding-model approach. These image-computable models accounted for overall response profiles better than current intuitive models of what these regions do (with several other thoughtful tests into the different facets of responses of these regions). These models then allowed the authors to conduct a high-throughput test of the category-selective process, finding no rogue non-category selective images in a set of 3.5M would likely yield a high response in these regions. And, they probed the nature of the features encoded in these regions leveraging some of the newest advanced methods available.
The paper is clearly written; the scope of the question at hand is nicely targeted, while the methods applied to this question are broad, ranging from measures of everyday intuitions to sophisticated, cutting-edge techniques like GANs for image synthesis; the analyses are sound; the patterns of results are strong; and the claims are clear, important, and well-justified.
I also think there are a rather substantial number of patterns of data that could be further scrutinized (!) --far beyond a single paper's worth of ideas here. As such, I am not going to tug on the theoretical threads of these datasets that I think could use more exploration (likely ideosyncratic to me), and instead will focus solely on the claims directly presented. To this end, I have only minor comments. (Relatedly, it is great that the authors are providing the data to the community for further exploration.) Thank you for your encouraging feedback and comments which have made the paper much clearer and stronger. We too are excited to see the impact these models and data make on the community.
How good are these models? My main comment is that I was left by the end without a clear sense of the answer to whether these encoding models have completely nailed the responses, and to what degree there is representational room to grow, as measured in this fMRI dataset. For example, figure 2 a-c are rather outstanding-I think, this is pretty close to perfect. But, figure 2f it seems like there may be pretty far to go for some categories in some regions. Especially, when in figure 3c, the same data (I think?) is presented, but this time with the noise ceilings so it's easier to tell what's measurement noise vs uncaptured structure. Is there important information to be gained by the fact that e.g. the within-face predictivity is much higher than the within-body predictivity in FFA, for example?). I think the main take-home about the success of the encoding models for these regions could perhaps be clarifed with some sort of synthesized summary of what they get right (which is a lot, obviously) but maybe also what your current analyses indicate that they are not capturing, given the reliable structure of responses to these 185 images.

Our within-category prediction analyses indeed expose a larger gap between the model predictions and the observed responses. As suggested by the reviewer we have now added a brief synthesized summary describing the success and the failures of the model. As the data suggest, there is a lot of across-category variance which the models (and humans) rightly pick up on, but there is also a lot of within-category variance that remains to be explained. This result also underscores the importance of collecting high-quality data which reveal the reliable differences between individual image exemplars which is necessary to expose the variance in response across these images. We now include a synthesized summary to the Results Section (see excerpt below)
"Taken together, these results show that ANN models of the ventral stream can predict the response to images in the FFA, PPA, and the EBA with very high accuracy. Further analysis on a trained ResNet50 show that the predictive accuracy of the models remain high even when tested on individual stimuli within categories and even across participants. Testing predictions within categories also exposed a larger gap between the model predictions and responses which indicates room for further model-development efforts." Generalizing from one subject to another subject. When reading these results I wanted a little more context: how well is the encoding model is transfering from one brain to another, given how well that brain directly correlates to the other brain? For example, is it at the noise ceiling of the data, or is there reliable structure measured in the FFA between subjects that the encoding model isn't capturing? Right now I cannot tell apart how much these analyses are telling me about the similarity and differences of people's FFAs, from the effectiveness of the encoding model itself. Figure  S4 (also shown below). This analysis shows that there is some small but reliable structure between the subjects' FFAs that the models are currently not capturing. This figure is now also included within the new Supplemental Figure 3.

The reviewer raises a fair point. We now provide the mean subject-subject agreement in
Model comparison. One of the first steps was the model screening procedure, with a large number of models compared. There are some interesting models in there like "curvature" and "room-layout" that don't seem like deep net models (but maybe they are?). In any case, I wish there was a little more information/meta data about each of these models, accompanying Fig  S5). Table 1. Fig 3b to the stats reported (e.g. are you reporting data pooled across hemispheres here? pg 9 ln 252 e.g

. unclear exactly what is being compared) The statistics were performed by comparing the model predictions with the expert/novice predictions individually on each fROI and hemisphere separately. We now specify this explicitly in the text where the statistics are reported.
Typo: "Linear combing" pg 13 line 360 Thank you! Fixed now.

Reviewer #3 (Remarks to the Author):
This study evaluates the fit between multiple (60) deep neural network models and human hemodynamic responses to natural images in three category-selective visual areas: FFA, PPA & EBA. Four human subjects underwent extensive fMRI scanning, measuring at least 20 repetitions of each of 185 natural images. The authors employed a well-established linear encoding approach: Particular neural network layers were evaluated as potential linear bases for predicting the hemodynamic responses to held-out natural images. The prediction accuracy (i.e., the correlation between predicted and observed responses) was assessed both at the regional average activity level and at the voxel activity level. The generality of the model's predictions was tested within and across subjects. Importantly, the authors demonstrated that the best performing model explains a large portion of the variability of the responses *within* object categories, going beyond plain category-based prediction. The authors compared the accuracy of the best performing model to the judgments of human novices with respect to the category-typicality of the images, as well as to expert predictions of the potential of different images to activate the regions. Last, the authors conducted several in silico experiments using the fitted encoding models. They searched for activating images across large natural image sets, synthesized activating stimuli using a GAN-parameterized activation maximization procedure, and used a masking algorithm to highlight activation-driving regions in the images. The in silico experiments all indicated that the maximally activating stimuli for these regions are respective category-members (e.g., faces for FFA), affirming the conventional view of the category-selectivity of these regions.
Overall, this is an important contribution to the computational modeling of high-level visual responses by deep-neural-network-based models. The study reflects a considerable body of work, with multiple models and well-motivated analyses and controls. Having said this, I have several comments about the manuscript as it is, but I believe that these issues are addressable by some additional work.

Thank you for your encouraging comments and constructive feedback which have substantially improved the paper. We have now performed several additional analyses and provide important clarifications which we hope address the remaining concerns.
Major points 1) Unaccounted-for selection bias. The authors evaluated 60 neural networks and then focused on the result from the best performing one (Resnet-50). While 10-fold cross-validation was employed to eliminate the bias introduced when fitting linear weights to map each model's activations to the brain responses, it is unclear from the methods section whether any measures were applied in order to eliminate the selection bias imposed by picking the best model among the 60 candidates. The concern is that this selection bias might cause exaggerated estimates of the prediction accuracy. Importantly, such a bias might invalidate the comparison between the best model and the human raters. Unlike the models, the accuracy of the human raters (novices and experts alike) was not subject to any maximum-taking operation, and hence the comparison might be unfair. In other words, the manuscript currently compares the best model to the average human rater, and it does so without correcting the bias incurred due to the "winner's curse".
A second source from which selection bias can creep in is the selection of the best layer in each DNN. Here as well, it is unclear whether the bias introduced by this selection operation was accounted for. To allow for a fair assessment of the models, this bias has to be mitigated by a proper *nested* cross-validation procedure, in which the most inner fold is used to fit the encoding parameters, the intermediate fold to select the best layer (and model), and the outer fold to obtain unbiased prediction accuracy estimates. An alternative, even stronger approach is a completely held-out test dataset. Failing to do so exaggerates the reported accuracy estimate and might bias the model comparison in favor of models with more numerous layers.
We thank the reviewer for raising these important points, which we address as follows:

To summarize, we are in agreement with the reviewer that it is possible for our estimates of model prediction accuracy may be inflated even when tested across participants and regions as we have done here. It is also for this reason that we feel that the model arbitration process should be done on completely new data over many regions and benchmarks. But, our models at the fROI level will make this question more easily testable by independent groups (compared to voxel-wise models). We now address this point explicitly in the discussion of the paper as well.
"Although ResNet50-V1 was the numerically most accurate base-model across regions (consistent with 31 ) based on a very broad screen it is important to note that a single study or small number of regions considered (as in our own work) is insufficient to definitively determine which single base-model is the most brain-like. Ultimately the model arbitration will require a community-wide effort and rigorous model benchmarking on a larger data set with new stimuli, subjects and brain regions. An important contribution of our work is the fROI-scale of computational modeling which makes it possible to evaluate our exact model on completely independent subjects, hypotheses, and data. fROIs like the FFA, PPA, and EBA can be isolated in almost all participants and our models make testable predictions and are more directly falsifiable than say voxel-wise models (even though as we show importantly, these exact metrics are correlated across the scales of computational modeling)." 2) The image set considered in the in silico image screening is very large, but it is limited to a narrow domain: natural images. This constraint means that non-natural images are never considered by this procedure. Even if one of the regions was highly responsive to non-natural, non-category-member images (for a concrete example, consider "contextually-defined faces", Cox, Meyers & Sinha, 2004 Science), this would never be discovered by the screening procedure since it tested only natural images. In principle, model-driven image synthesis (similar to that applied by Bashivan, Kar & DiCarlo, 2019 Science) could have addressed this point. However, since the current work limits the synthesized images to those produced by a strong GAN trained on natural image corpus, the stimulus-synthesis procedure's capacity to form novel, non-objectlike activating images is severely limited. I suspect that if the authors had used a less restricted activation maximization procedure, less semantically-sensible activating images would have arisen, potentially revealing incorrect (i.e., adversarial) model predictions. This point touches upon a whole can of worms associated with deep neural networks as vision models (adversarial examples, metamers, and so on). The paper's discussion in its current form does not mention this obvious shortcoming of the models.
To summarize this point, the interpretation of the image screening as affirming the categoryselective nature of the regions is not completely substantiated due to the limited domain of the screening images (as well as the lack of empirical testing; see the next point). The control analysis on the "face-selective" AlexNet units is necessary but not sufficient since it addresses the case of responsiveness to non-category-member natural images but not the case of responsiveness to non-natural images. Furthermore, announcing that the current approach is "subjecting claims of category selectivity to their strongest tests to date" (line 21 in the abstracts, as well as lines 266-267) neglects the extremely rich literature on modified, jumbled, or otherwise non-natural 'trick' stimuli that reveal non-trivial response properties of these high-level regions.

The reviewer correctly highlights a limitation of our study, which is that we only consider naturalistic stimuli (previously discussed in Lines 440-460 and made more explicit now, see below). Our claim of "subjecting category-selectivity to strongest tests yet" is with respect to the number of stimuli considered (millions in our case via our high throughput screening procedure, and perhaps much more with the GAN, compared to at best dozens in prior studies). We are currently exploring a wider space of non-natural images, but that is a much less constrained space, and will require extensive closed-loop testing, which we hope to write up in a future paper.
Based on the reviewer's suggestion we are now making the following changes: First, we now state our claim more precisely as "subjecting category-selectivity to strongest tests yet on naturalistic stimuli". (Note that these changes have now been made in the abstract, results and the discussion accordingly)

Next, we expand on this limitation further in the Discussion:
"One route will make use of new models now being developed that include known properties of biological networks 42-45 and that may better fit neural responses. Another route will learn from model failures in an effort to improve predictive accuracy. For example, the ANNs used here were trained on naturalistic images, are vulnerable to targeted adversarial attacks 46,47 and are known to have limited generalizations to out-of-domain sample distributions (for instance, do not generalize from natural images to line drawings). These shortcomings suggest that these models are not likely to accurately predict observed fMRI responses to more abstract 3,48,49 and symbolic stimuli 50-53 including contextually-defined faces 54,55 . Thus an important avenue of future work will entail exposing and expanding the bounds within which these models retain their predictive power. An effective strategy could be to run ever more targeted experiments that push model development efforts either by including more diverse set of stimuli for training (like sketches, line drawings, cartoons etc.), or stronger inductive biases as biological networks."

Screening on sketches instead of naturalistic images. Here we performed our high-throughput screening procedure on stimuli from IMAGENET-Sketch by Wang et al., Neurips 2009 (downloaded from https://github.com/HaohanWang/ImageNet-Sketch). This database contains 50,000 images (50 images for every IMAGENET class). Note that these images are much more impoverished than the full naturalistic stimuli used in the manuscript and have very different overall image statistics. Below are the top 50 images predicted to maximally activate the FFA, PPA, and the EBA (Note that the image duplications and mirror flips are only because those images have been repeated in the image-set). These preliminary results indicate that the models still pick out images that are members of the hypothesized preferred category for these regions. We are currently in the process of expanding the stimulus domains even further. But note that these are exciting predictions that will benefit from closedloop tests which we plan to undertake in a subsequent study on this topic.
3) The distinctions between model predictions and empirical findings are somewhat blurred. The in silico experiments conducted on the fitted models are informative and contribute to the literature. In particular, I find Figure 4b particularly novel. However, all of the results from line 269 and onwards are model predictions rather than empirical neuroscientific findings. The authors did not call the subjects back to the scanner to test whether the natural and synthetic maximally activating stimuli predicted by the model are indeed maximally effective as stimuli for human cortices. While the authors explicitly acknowledge the distinction between predictions and results in some places in the paper (e.g., lines 377-378), this distinction is glossed over elsewhere. For example, in lines 311-312, the authors state: "This finding further strengthens the inferences that these regions are indeed selective for faces and places." This deduction does not rely on actual empirical testing.
Based on the logic of the study, we first built highly predictive encoding models of the categoryselective fROIs. We then used these models to then turbo-charge our search for 'outliers' but we did not find any. Importantly, the models were in principle capable of finding such outliers (see AlexNet control). Had we found images that the models would have predicted to activate the FFA higher than faces, we would have tested them, but we did not.

To clarify, the goal of this current paper was not to demonstrate non-invasive control (for which we would have called subjects back to the scanner to check for maximum activation) but to use the models to answer the specific scientific question of category-selectivity. Despite using the high throughput screening and synthesis methods, our analyses did not reveal any violation of the prior hypothesis to test experimentally. We clarify this logic explicitly now in the text.
"Had we found any stimuli predicted to produce a strong response in a region that were not members of the hypothesized preferred category for that region, we would have cycled back to scan participants viewing those images to see if indeed they produced the predicted high responses. But we did not find such images, so there were no potentially hypothesis-falsifying stimuli to scan." 4) The description of the data analysis procedure is insufficient for evaluation and replication. The authors provide some external references, but these are insufficient for understanding what exactly was done. In particular, I'm concerned with the lack of description of how the best layer within each model was selected (was there nested cross-validation as suggested above?), missing details of the linear fitting (what ridge parameters were considered and how were they selected?) and absent information about the "two-stage linear mapping function". For the latter, the authors cite Bashivan, Kar & DiCarlo 2019, but it is unclear whether the implementation described there, applied to modeling particular neurons rather than voxels, is identical to what was done here.

5) Can the authors provide any form of statistical inference of the differences between the various models?
The prediction accuracy (Fig. S3) seems quite similar for the top ten models. Relevant sources of variability that contribute to the ranking uncertainty are the finite sampling of stimuli, the finite (and small) sampling of subjects, and the finite number of repetitions.

We agree with the reviewer that model arbitration is an important exercise which we think will take the effort of the entire community and several experiments and benchmarking (instead of using predictive accuracy as the only benchmark) over several metrics. (We refer you to our papers on Brain-Score and integrative benchmarking for our stand on these issues)
We now expand on this in the Discussion as well: "Although ResNet50-V1 provided the numerically most accurate models across regions (consistent with 31 ) based on a very broad screen of models it is important to note that a single study or small number of regions considered (as in our own work) is insufficient to prescribe a single base-model as being the most brain-like. Ultimately the model arbitration will require a community-wide effort and rigorous integrative benchmarking on completely independent data from new subjects and evermore regions (similar to say our BrainScore platform 14,26 for non-human primate data). But an important contribution of our work is the fROI-scale of computational modeling which makes it possible to evaluate our exact model on completely independent subjects, hypotheses, and data. fROIs like the FFA, PPA, and EBA can be isolated in almost all participants. Our models make can testable predictions and are thus more directly falsifiable than say voxel-wise models (even though as we show importantly, these exact metrics are correlated)." 6) Reliability estimates/"noise ceilings" should be plotted for each figure that depicts prediction accuracy. The way these estimates were computed should be explicitly stated in the methods section.

Based on the reviewer comments, we have added the noise-ceilings to all Figures that show predictive accuracy measures. These include Figure 2 itself (instead of just showing in Figure 3) and the Supplemental Figures. The description of how these were computed is also included in the Methods section.
7) The manuscript somewhat unfairly downplays previous works on encoding visual responses by deep neural network models. In particular, Wen, Shi, Chen & Liu (2018, Scientific Reports, reference 12 in the manuscript) have used the same Resnet-50 indicated as the best model here and performed quite extensive in silico experiments on category-selectivity. It would serve the readers better to situate the current work in the context of the existing literature. Another highly relevant work (not currently cited) is Eickenberg, Gramfort, Varoquaux, and Thirion, 2017 NeuroImage.

The Eickenberg reference was a glaring oversight which we have now included in the paper. We thank the reviewer for pointing it out. We also now emphasize the contribution and converging evidence from the Wen et al. (2018) paper, both in the Results and the Discussion.
8) Given the well-established infrastructure for sharing fMRI data (the Openneuro platform and the BIDS format) and the high potential utility of the data collected by the current study, the proposed data sharing policy ("upon request") seems to be suboptimal. The intended goal of all of our work is to compile high-quality experimental data as well as computational models and make them accessible to the community. To clarify, we will be making the models of the FFA, PPA, and EBA as well as the data used to build them publicly available on publication.
Minor points:

9) The number of participants (four) has to be mentioned in the first Results paragraph.
We have now changed the very first sentence of the Results section accordingly: "We scanned four participants with fMRI to first localize the FFA, PPA, and EBA…" 10) Many data points in Figure 3c have no error bars. Figure 3c are present, just that in some cases they are too small and occluded by the marker (circle). Figures 3c and 3e randomly sampled, best performing, or manually chosen? Images in Figure 3c are the exact top images for the FFA and the PPA. For the EBA, we did have to remove 2 NSFW (not safe for work) images (but the 5 images included for the EBA are otherwise among the top 6 observed) Images in Figure 3e were chosen randomly from the 10 we synthesized for each region 12) The image synthesis section does not include necessary details for replication such as optimizer choice, and optimization hyper-parameter, stopping conditions, and GAN latent initialization.

Please find the updated Methods section which describes all the specifics of the GAN synthesis procedure.
<b>REVIEWER COMMENTS</B> Reviewer #1 (Remarks to the Author): The authors have addressed the comments I had fully. I am looking forward to the next papers in this research line. Exciting new avenue.

Steven Scholte
Reviewer #2 (Remarks to the Author): The revision has addressed my comments.
Reviewer #3 (Remarks to the Author): The authors did considerable work improving the manuscript. In particular, I appreciate the inclusion of the noise ceiling estimates.
I am still uncertain about selection bias and cross-validation. The authors report in the response letter that "predictivity scores are based on cross-validation on held-out stimuli after the layer is chosen." This is consistent with two distinct procedures: (I) The 5-fold cross-validation used for layer choice (line 623) and the 10-fold cross-validation used for the convolutional mapping do not use any of the "held-out 10% stimuli" (line 156). Such nesting of the model fitting crossvalidation within the model evaluation cross-validation is not explicitly described in the paper.
(II) Alternatively, there is only a single level of cross-validation. Training accuracy is used for picking the best layer and for fitting the convolutional mapping. For each cross-validation fold, the unbiased accuracy of the fitted model is evaluated on the fold's test data. These accuracy estimates are then averaged across folds, pooling together evaluations of potentially distinct layer choices and convolutional mappings.
Was one of these two procedures employed? Should the reader treat all of the reported accuracies across the paper as unbiased estimates? Confusingly, the notion of holding out stimuli is introduced only in line 156 rather than at the very beginning of the methods section, suggesting that not all of the accuracy estimates were based on predicting fully held-out data.
The answers to these questions should be evident from the manuscript itself. In particular, the data handling procedure should be described in the Methods section in a level of detail that enables precise reproduction. If some of the accuracy estimates are potentially biased (i.e., they result from hyper-parameter sweeps without external validation), this caveat should be transparently reported and discussed.
As mentioned in the previous review, this concern relates not only to the numerical estimates themselves but also to the validity of the comparison between model and human predictions (Figure 3).