Predicting eye movement patterns from fMRI responses to natural scenes

Eye tracking has long been used to measure overt spatial attention, and computational models of spatial attention reliably predict eye movements to natural images. However, researchers lack techniques to noninvasively access spatial representations in the human brain that guide eye movements. Here, we use functional magnetic resonance imaging (fMRI) to predict eye movement patterns from reconstructed spatial representations evoked by natural scenes. First, we reconstruct fixation maps to directly predict eye movement patterns from fMRI activity. Next, we use a model-based decoding pipeline that aligns fMRI activity to deep convolutional neural network activity to reconstruct spatial priority maps and predict eye movements in a zero-shot fashion. We predict human eye movement patterns from fMRI responses to natural scenes, provide evidence that visual representations of scenes and objects map onto neural representations that predict eye movements, and find a novel three-way link between brain activity, deep neural network models, and behavior.

Here, we present two analyses more consistent with the mainstream spatial attention modeling literature. The first analysis shows that explicitly re-weighting CNN activity to predict eye movements improves prediction performance for decoding models derived from faceidentification and random CNNs, but not scene-and object-categorization CNNs. The second analysis shows that model-based fMRI prediction performance can be improved using an empirically-defined baseline fixation distribution to correct for center-bias.
Re-weighting CNN activity to explicitly predict fixation patterns. Here, we aim to re-weight CNN activity to improve prediction performance in the spatial attention models and model-based reconstructions. Such an approach tests whether features from a given CNN type map onto spatial representations in the brain that predict eye movements if an additional learning step explicitly linking CNN activity to eye movements is included.
To learn weights across CNN channels that improve eye movement prediction, we predicted fixation map values from CNN unit activity drawn from the five pooling layers (1,472 channels total) using support-vector regression with a ridge penalty on data from the MIT Eye Movement Dataset 1 . Activity maps for each CNN channel were linearly interpolated to the spatial resolution of pool1 (112 x 112 px) and all re-sized units within a layer were normalized to have zero-mean and unit standard deviation. For each of the 1000 training images, this produced a [1472 x 112 x 112] matrix of CNN activity. Fixation maps were calculated for each of the training images at the group-level (n = 15) and smoothed using a 2D Gaussian kernel (SD = 20 px, matched to the cross-validated smoothing kernel from our validation datasets). For each training image, we randomly sampled 100 image locations to build the data matrix for the regression. For a given sampled location, the fixation map value becomes a new Y and the CNN activity values across all 1472 channels become a new row of X's, leading to a final Y vector of [100,000 x 1] and an X matrix of [100,000 x 1472]. The regression outputs a [1 1472] vector of beta weights that can be multiplied by a [1472 12544] matrix of CNN activity for a given image to re-weight the activity to better predict eye movements.
Using the learned weighting to calculate computational spatial priority maps from CNN activity improved prediction performance for all CNN types (Supplementary Fig. 1a). Modest improvements were seen for the scene-and object-categorization CNNs, and markedly greater improvements were seen for the face-identification and random CNNs.
Next, we show results for predicting eye movements using model-based reconstructions for each CNN type that average across channels or computed a weighted sum across channels. NSS scores can be seen for all analysis types and ROIs in (Supplementary Fig. 1b & 1d) and NSS difference scores (average model -weighted model) can be seen in (Supplementary Fig. 1b &  1d). Significance markers for the difference scores in Supplementary Fig. 1 represent the main effect for model type (average vs weighted) in a 2-way repeated-measures ANOVA with model type and ROI as factors. We find that performance is equivalent for the average and weighted approaches for decoding models using scene CNNs (Supplementary Fig. 1c & 1e, first column) for within-individual and internal validations. For external validation, the average model outperformed the weighted model for base reconstructions and equivalent performance was seen for smoothed and center-bias corrected reconstructions. For base and smoothed/centerbias corrected reconstructions from object CNNs (Supplementary Fig. 1c & 1e, second column), performance was equivalent for within-individual validation, and the average model outperformed the weighted model for internal and external validation. For base and smoothed/center-bias corrected reconstructions from face and random CNNs ( Supplementary  Fig. 1c & 1e, third and fourth columns), performance was equivalent for within individual validation, but the weighted model outperformed the average model for internal and external validation.
Overall, these results show that features optimized for scene and object categorization best generalize off-the-shelf to characterize spatial representations in visual brain regions that predict eye movements. For scene and object categorization CNNs, re-weighting was not necessary to get respectable computational and brain-based predictions; the averaging model using the relative weighting of channels set through learning to complete the visual categorization task already well captures spatial contingencies in scenes that are consistent with eye movement patterns, both behaviorally and in the brain. The relative weighting amongst channels for the face and random CNNs does not capture spatial information relevant to predicting eye movements by default; an additional explicit learning step is necessary to achieve performance comparable to the scene and object CNNs off-the-shelf.

Empirical center-bias correction.
To correct for center-bias empirically, the center-model was defined as a baseline fixation distribution across all images except the target image in a separate set of participants. Calculation of these baseline distributions was cross-validated across data sets (internal and external validation). For example, the empirical baseline for Image A in the withinindividual or internal validation analyses was defined as the average fixation density map for all other images and all participants in the external validation dataset. The empirical baseline for the same image in the external validation analyses was the average fixation density map for all other images and all participants in the internal validation dataset. Each empirical baseline was re-scaled from 0 to 1. To correct for center-bias in the reconstructions using these empiricallyderived baselines, we pointwise multiply the baselines with a spatial priority map reconstruction after the reconstruction has been smoothed with a 2D Gaussian kernel. The procedure is the same as for the Gaussian center-bias correction used in our primary analyses.
Empirical correction for center-bias improved reconstruction prediction performance across the within-individual, internal validation, and external validation analyses (Supplementary Fig. 2). As for the model-based reconstructions presented in the manuscript, empirically center-bias corrected reconstructions from V1, V2, V3, and hV4 significantly predicted eye movement patterns in the within-individual, internal validation, and external validation analyses (all P<0.001). Example reconstructions can be seen in (Supplementary Fig. 3 Fig. 2. Empirical center-bias correction (c) improves prediction performance. Results for base reconstructions (a) and reconstructions center-bias corrected with a 2D Gaussian (b) are included for comparison. Error bars represent standard error of the mean across participants in the fMRI dataset (n = 11, Within-Individual and Internal Validation) and external validation dataset (n = 22, External Validation). Significance is defined using permutation testing. * P < 1 x 10 -2 , ** P < 4.55 x 10 -3 (Bonferroni-corrected threshold), *** P < 1 x 10 -3 . Supplementary Fig. 3. Modelbased reconstructions empirically corrected for center-bias (c). Base reconstructions (a) and reconstructions corrected for center-bias with a 2D Gaussian (b) are included for comparison. Supplementary Fig. 4. Computational spatial attention model results for all CNN types, and results for all CNN types and benchmark measures sorted by prediction accuracy. a. Internal validation. b. External validation. Error bars represent standard error of the mean across all participants in the internal (n = 11) and external (n = 22) validation sets).