Attention modulates neural representation to render reconstructions according to subjective appearance

Stimulus images can be reconstructed from visual cortical activity. However, our perception of stimuli is shaped by both stimulus-induced and top-down processes, and it is unclear whether and how reconstructions reflect top-down aspects of perception. Here, we investigate the effect of attention on reconstructions using fMRI activity measured while subjects attend to one of two superimposed images. A state-of-the-art method is used for image reconstruction, in which brain activity is translated (decoded) to deep neural network (DNN) features of hierarchical layers then to an image. Reconstructions resemble the attended rather than unattended images. They can be modeled by superimposed images with biased contrasts, comparable to the appearance during attention. Attentional modulations are found in a broad range of hierarchical visual representations and mirror the brain–DNN correspondence. Our results demonstrate that top-down attention counters stimulus-induced responses, modulating neural representations to render reconstructions in accordance with subjective appearance.


Reporting for specific materials, systems and methods
We require information from authors about some types of materials, experimental systems and methods used in many studies. Here, indicate whether each material, system or method listed is relevant to your study. If you are not sure if a list item applies to your research, read the appropriate section before selecting a response. Note that full information on the approval of the study protocol must also be provided in the manuscript.
The original sample size was chosen on the basis of previous fMRI studies with similar experimental designs (n = 5), and then we further collected data from additional two subjects (Subjects 6 and 7) following the request by the editor through the revision. The first three subjects (Subjects 1-3) were the same as those in a previous study (Shen et al., 2019). For these subjects, we reused a subset of previously published data (data for the training session, which was originally referred to as "training natural image session" of the "image presentation experiment"; available from https://openneuro.org/datasets/ds001506/versions/1.3.1), while newly collecting additional data (data for the test session). For the last four subjects (Subjects 4-7), we newly collected a whole dataset (data for the training session and the test session).
No data were excluded.
Results from multiple subjects and trials can be considered as replications of the analysis. Using fMRI data of initially collected five tested subjects, the main findings were independently replicated from four subjects with multiple successful trials for each subject. Furthermore, at the request of the editor and reviewers during the revision, we have additionally collected data from two more subjects and confirmed the replicability of the main findings with those new subjects.
The full fMRI dataset was collected from individual subjects (within-subject design), and no subject randomization was performed.
Blinding was not relevant, because no randomization was done.
Subjects were recruited for their ability to participate into multiple fMRI session, in which each session can take at most 2 hours. All subjects had considerable experience participating in fMRI experiments, and were highly trained. Thus, these characteristics of the subjects may contribute to guarantee the quality of the data.
The study protocol was approved by the Ethics Committee of ATR. All seven subjects participated into two types of experimental sessions: a training session and a test session. Data from each subject were collected over multiple scanning sessions. On each experimental day, one consecutive session was conducted for a maximum of 2 hours. Subjects were given adequate time for rest between runs (every 7-10 min) and were allowed to take a break or stop the experiment at any time. The training and test sessions both consisted of 24 and 16 separate runs, respectively. Each run comprised 55 trials (7 min 58 s for each run). Each trial was 8-s long with no rest period between trials. Additional 32-and 6-s rest periods were added to the beginning and end of each run, respectively. The whole training session was repeated five times.
The imaging was done to acquire functional images covering the entire brain.
We performed the MRI data preprocessing through the pipeline provided by FMRIPREP (version 1.2.1). For functional data of each run, first, a BOLD reference image was generated using a custom methodology of FMRIPREP. Using the generated BOLD reference, data were motion corrected using mcflirt from FSL (version 5.0.9) and then slice time corrected using 3dTshift from AFNI (version 16.2.07). This was followed by co-registration to the corresponding T1w image using boundary-based registration implemented by bbregister from FreeSurfer (version 6.0.1). The coregistered BOLD time-series were then resampled onto their original space (2 × 2 × 2 mm voxels) using antsApplyTrainsforms from ANTs (version 2.1.0) using Lanczos interpolation.
The data were not normalized.
The data were not normalized.
A constant baseline, a linear trend, and six motion parameters were removed.
No volume censoring was applied.
The data samples were temporally shifted by 4 s (2 volumes) to compensate for hemodynamic delays, were despiked to reduce extreme values (beyond ± 3 SD for each run), and were then averaged within each 8-s trial (training session, four volumes), the last 6-s period of each trial (test session, three volumes corresponding to second to fourth volumes in each trial).
Multivoxel pattern regression models were constructed to predict (regress) feature values of a deep neural network model, from patterns of brain activity in tested brain areas.
We used brain areas functionally defined for individual subjects. V1, V2, V3, and V4 were delineated following the standard retinotopy experiment. The lateral occipital complex (LOC), fusiform face area (FFA), and parahippocampal place area (PPA) were identified using conventional functional localizers. A contiguous region covering the LOC, FFA, and PPA was manually delineated on the flattened cortical surfaces, and the region was defined as the higher visual cortex (HVC). Voxels overlapping with V1-V3 were excluded from the HVC. Voxels from V1-V4 and the HVC were combined to define the visual cortex (VC).