Introduction

While humans and animals can immediately recognize scenes, objects, and materials at a glance1,2,3,4, in many situations they need to sequentially sample and integrate information to take more appropriate decisions about the world5,6. In this serial vision process, attention and eye movements play a critical role in the selection of relevant information from spatially distributed image inputs5,7,8. (It should also be noted that whether covert attention is parallel or serial has been controversial for decades9,10,11,12).

A large body of psychophysical evidence suggests that focal attention on the target’s spatial location facilitates visual processing of the target6,13,14,15,16,17,18,19,20. In most of the studies, attention has been demonstrated to increase detection performance and accelerate the reaction to the target. However, such behavioral measures alone may be insufficient in explaining how attention affects the neural process of making decisions about the target, unless the results are compared across a variety of conditions.

In visual neuroscience, reverse correlation analysis has been widely applied to reveal information that determines the system response21,22. This analysis has been applied not only to the responses of cortical neurons23,24,25 but also to human behavioral responses for a variety of visual tasks26,27,28,29,30,31,32,33. As a variant of the analysis, the classification image (CI) method allows the visualization of what information in the stimuli observers consider important for a given perceptual judgment34. In a typical experiment of the CI method, the observer’s responses to a visual target embedded in white noise are collected, and the information in the stimulus that affected the observer’s response is mapped out by analyzing the correlation between the noise and the response in each trial. The CI method has been widely used to reveal the spatiotemporal distribution of critical information (or the perceptive field) that determines the observers' judgments for various visual tasks with static and dynamic stimuli35,36,37,38,39,40,41.

Eckstein et al.36 applied the CI method to Posner’s cueing paradigm13 and showed that the weight of information in the CI is greater at the spatial location where attention was directed36. However, observers in these studies made judgments after the visual stimuli had been shown, like in many psychophysical reverse-correlation studies28,29,30,32,33. Such post-stimulus judgment, usually based on the visual working memory, is not necessarily representative of the on-the-fly judgments that we make in real life. It is desirable to examine the effect of dynamic attention to the CI in the period before the observer responds to the shown target.

To clarify when observers make decisions during observation and what information they rely on to make decisions during observation, we can analyze correlations at each time point of the stimulus locked to the reaction time (RT) of the observer during the presentation of the stimulus rather than at the stimulus onset. Several studies have adopted this response-locked reverse correlation analysis in investigating the dynamics of perceptual decision making27,42,43,44. Maruyama et al.45 recently applied response-locked CI analysis to a very basic visual task, namely luminance contrast detection. Using a visual display similar to that used by Neri & Heeger37, they measured responses and reaction times for a target stimulus slowly emerging from dynamic noise and calculated the correlation between the noise and response at each time point, locked to the observer's reaction time (RT), in reverse chronological order. Adopting this protocol, they examined what signals and what point in the stimulus determined the observer’s decision about the target and RT, and they revealed spatiotemporally biphasic CIs and their RT-dependent variability. They further suggested that the results could be quantitatively reproduced by a simple computational model incorporating a perceptual process approximated by a spatiotemporal filter and a (drift–diffusion) decision process that accumulates the output.

In the present study, we extended the above experimental protocol to investigate the effects of spatial attention. To this end, in addition to noise and target stimuli similar to those used by Maruyama et al.45, we presented a cue indicating the target's location (a valid cue) or another location (an invalid cue) as used in Posner’s paradigm. Using this display, we measured responses and reaction times (RTs) for a target stimulus that appeared gradually in dynamic noise, and we compared RTs and CIs between the valid and invalid cue conditions. We found that the overall amplitude of temporally biphasic CIs was larger for the valid cue than for the invalid cue in the temporal period that the RT was shortened by attention. These results were well explained by a simple perceptual-decision model45, with the introduction of a single additional assumption that spatial attention sustainedly increases the gain of the perceptual process.

Methods

Observers

We initially recruited seven participants, which was the same number as in the previous study45. However, the data of one observer whose reaction time was found to be exceptionally long were excluded in the early phase of experiment. The results of the others showed that one of the expected effects—an interaction between the cue condition and the peak of the amplitude—was statistically significant. Based on the effect size in this result, we conducted an a priori power analysis using G*Power 3.1 which revealed that a sample size of ten should be sufficient to achieve 0.95 power46. We then recruited four more participants, and finally ten human observers with corrected-to-normal vision, including three of the authors and seven naïve paid volunteers (having an average age of 24.2 years), participated in the experiment. All experiments were conducted in accordance with the Declaration of Helsinki. The study was approved by the Ethics Committee of The University of Tokyo. All observers gave written informed consent.

Apparatus

Visual stimuli were generated on a personal computer and displayed on liquid–crystal display monitors. To deal with the COVID-19 situation, the monitors (three BENQ XL2720B, three BENQ XL2730Z, one BENQ XL2735B, one BENQ XL2430T, one BENQ XL 2731K, one SONY PVM 2541A, and one SONY PVM-A250) were set up in the participants’ own homes. The mean luminance of uniform backgrounds ranged from 44 to 116 cd/m2. The minimum luminance of the monitors ranged from 0.01 to 2.91 cd/m2, and the maximum luminance ranged from 87 to 232 cd/m2. All monitors had gamma-corrected luminance as calibrated with a colorimeter (ColorCal II CRS). The frame rate was 60 Hz. The viewing distance was adjusted to achieve a pixel resolution of 0.018 deg/pixel.

Stimuli

The visual display followed that of the previous study45. The stimulus was a pair of square images of dynamic one-dimensional noise (4.6 × 4.6 deg), each comprising 16 vertical bars with a width of 0.29 deg (Fig. 1). The contrast (\({C}_{noise}\left(t\right)\)) of each bar switched at a frame rate of 30 Hz according to a Gaussian distribution having a standard deviation of 0.1. The total stimulus duration was 4500 ms. Two independent one-dimensional noise fields were presented adjacent to the left and right of the gaze point.

Figure 1
figure 1

Schematic diagram of the visual stimuli used in the experiment. The upper panel shows a snapshot of a single stimulus frame. The bright bar on the right is the target, and the small horizontal bars above and below are the cue. The lower panel shows an X–T plot of the luminance variation of each bar. The target appears slowly in the right field. As shown in the right plot, the cue was presented for 100 ms with random timing from 500 to 2000 ms following the beginning of the increase in target contrast.

There are two target positions: left and right. The target signal (\({C}_{target}\left(t\right)\)) was added linearly to the two bars at the center of the noise area on either side. The addition of the target signal was set to start at a random frame between 0 and 500 ms after the start of the noise presentation. The luminance contrast of the target stimulus, \({C}_{target}\left(t\right)\), increased with time (t) according to

$${C}_{target}\left(t\right)={10}^{min\left(0.05\times t-3, 0\right)},$$
(1)

where t is the frame number (33 ms per unit) from stimulus onset and min is the minimum function between \(0.05\times t-3\) and 0. The contrast of each bar was clipped in the range from − 1 to + 1. The two fields, with and without the target signal, were transformed into a luminance image using the relation L(t) = Lmean (1 + C(t)), where Lmean is the mean luminance of the uniform background, which depended on the monitor of each observer (44–116 cd/m2).

In addition to the noise and target, a cue that probabilistically indicated the location of the target was presented. The cue was a pair of bright rectangles having a height of 0.3 deg and width of 0.6 deg. They were flashed for 100 ms at 0.6 deg above and below the central two bars in either the left or right field. The cue was presented at random timings of 500–2000 ms after the beginning of the increase in target contrast. The cue was presented in 94% of the trials. In 75% of these trials, the cue was presented at the target location (i.e., the valid cue condition), and in the remaining 25% of the trials, it was presented on the opposite side (i.e., the invalid cue condition).

Procedure

Observers binocularly viewed the display with a steady gaze at the fixation point and were asked to indicate by pressing a button whether the target appeared in the left or right field as rapidly as possible. Observers were informed that the cue would be presented at a valid location in the majority of trials. Auditory feedback was given for errors and responses that exceeded the deadline (4500 ms), and data in those trials were excluded from the analysis. The average error rate was 2%. The next trial started no less than 0.5 s after the observer's response. Each session of the experiment comprised 160 trials, and sessions were repeated until at least 3200 trials (4960 at maximum) were completed for each observer. In the analysis, all trials in which the observer responded before the cue, regardless of the validity of the cue, were included in the no-cue condition, along with trials in which the cue was not presented until the end.

Results

Reaction time

Figure 2a shows the reaction time (RT) as a function of the cue onset time. Each data point is the mean RT in each 100-ms epoch from the cue onset. The RT of the individual observer is defined as the harmonic mean, which is a more reliable index for measuring the central tendency of RT than the arithmetic mean47, across trials in each epoch. The red lines show the results for the valid cue, the blue line those for the invalid cue, and the black line those for no cue. RT is ~ 100 ms shorter for the valid cue than for the invalid cue, suggesting the effect of cueing. It is seen that the facilitation is weakened and diminishes when the cue is presented at ~ 1500 ms or later, but this is simply because this range contained slow-response trials in which observers responded after the contrast of the target had become high. A two-factor ANOVA revealed significant main effects of the cue condition (F(2,18) = 48.48, p < 0.0001, η2 = 0.1011) and cue onset time (F(14,126) = 33.36, p < 0.0001, η2 = 0.0761) and a significant interaction between them (F(28,252) = 21.18, p < 0.0001, η2 = 0.0660).

Figure 2
figure 2

Effects of the spatial cue on the reaction time. (a) Reaction time as a function of the timing of the cue onset. Data are for each 100-ms epoch. The vertical axis shows the RT for the valid cue condition (red line), invalid cue condition (blue line), and no-cue condition (black line). The light-colored bands represent ± 1 S.E.M. across observers. (b) Reaction time as a function of the time from the response to cue onset. Data are for each 100-ms epoch.

To better capture the dynamic variation in RT due to spatial cueing, we also calculated the mean RT as a function of the timing of the cue onset from the observer's response as shown in Fig. 2b. The plot shows that the RT was especially shortened when the valid cue was onset 500–1000 ms before the response. The RT was particularly long when the invalid cue was onset 500 ms before the response. We also find that the RT become longer when the cue was onset 1000 ms before the response, but this is simply because most of the responses in these trials were exceptionally slow regardless of the validity of the cue. It may appear strange that the difference in the RT between the valid and invalid cue conditions was small when the cue was onset ~ 300 ms before the response, but this difference was small simply because the cue was presented during the motor delay between the decision and button press (see also the section on computational modeling). A two-factor ANOVA revealed significant main effects of the cue condition (F(2,18) = 51.13 p < 0.0001, η2 = 0.1074) and cue onset from the response (-1500 - 0 ms) (F(14,126) = 15.05, p < 0.0001, η2 = 0.0497) and a significant interaction between them (F(28,252) = 18.41, p < 0.0001, η2 = 0.0604).

Classification image analysis

Using the same procedure as in the previous study45, we conducted reverse correlation analysis of the contrast of each bar and the observer's response (left, right) at each time point (t) back from the reaction time, and we calculated the Classification Image (CI). Figure 3a is a diagram of the analysis. As in the work of Eckstein et al.36, CIs were calculated separately for the noise field where the target was presented and that where the target was not presented.

Figure 3
figure 3

(a) Diagram of the response-locked classification image. The classification image (CI) was calculated for each bar contrast at each spatiotemporal position backward from the observer’s response time. (b) The upper images show CIs for the valid (left) and invalid (right) conditions, where the vertical axis indicates space and the horizontal axis indicates the time from the response. The lower curves show the average weights of the two bars in the center (red curves) and the average weights of the two adjacent bars (blue curves). The light-colored bands represent ± 1 S.E.M. across observers. The upper panels show the results for the target field and the lower panels show the results for the non-target field.

The images in Fig. 3b show the average CIs obtained for the target (upper panels) and non-target (lower panels) fields. For each, the left and right panels respectively show the results for valid and invalid cue conditions. The brightness of each pixel represents the weight of noise on the response at spatial each position (vertical axis) and at each time before the response (horizontal axis). Bright pixels represent positive weights and dark pixels represent negative weights. The curves below the CI show the average weights of the two central bars (red) and the two adjacent bars (blue) in the CI. We refer to these as impact curves.

The CIs and impact curves obtained for the two cue conditions have similar trends, except that the amplitude appears slightly larger under the valid cue condition. For both cue conditions, there is clearly a characteristic spatiotemporal biphasic profile immediately before the response in the target field (upper panels). At the center of the CI where the target appeared, positive weights peak at approximately 350 ms before the response and negative weights at approximately 500 ms before the response. The opposite profile is found on both sides surrounding the center. These results, having clear agreement with the results of the previous study45, suggest that the observer's decision is triggered by the luminance increase after the luminance decrease in the central bars and the relative emphasis of these luminance variations in space. Meanwhile, the CIs of the non-target field (lower panels) show no systematic variation, and we will not consider these data in the following analysis (see also the discussion section regarding the interpretation of this result).

The RT data shown in Fig. 2b indicate that the observer’s response was systematically accelerated when the valid cue was presented at a particular time before the response. With reference to these results on the RT, we next examine whether the CI varies between the valid and invalid cue conditions depending on how long before the response the cue appears. To this end, we divided the trials into epochs of 500 ms each based on the cue onset time from the response and calculated the CI for each epoch with a shift of 250 ms.

Figure 4a shows the CIs and impact curves for different times from the response to cue onset in the target field. The impact curves preserve a biphasic shape regardless of the cue onset time backward from the response. There is a negative weight of the center impact curve when the cue onset was long before the response. Given that trials in which the cue appeared long before the response are dominated by long RTs, this reflects the fact reported in the previous study that the negative weight is larger in trials of longer RTs45.

Figure 4
figure 4

(a) CIs and impact curves calculated for different times from the response to cue onset in the target field. From the upper to the lower panels, the results are for trials with the cue appearing shortly to long before the reaction. The left panels show the results for the valid cue condition and the right panels the results for the invalid cue condition. (b) Peak impact at approximately 350 ms before the response plotted as a function of the time from the response to cue onset. Red and blue circles represent results for the valid and invalid cue conditions, respectively. The solid and dashed lines show the results for the center and surrounds, respectively. Error bars are ± 1 S.E.M. across 10 observers.

The overall amplitude of the impact curve is visibly different between the two cue conditions at a specific cue onset time backward from the response. Figure 4b plots the peak of the impact curve for each epoch. Here, the peak of the amplitude is defined as the average of the impacts within the range of 300–400 ms backward from the response. The red circles represent results for the valid cue condition and the blue circles those for the invalid cue condition. The solid line presents the results for the center impact and the dashed line those for the surrounds impact. We find that the positive weight is considerably larger for the valid cue condition than for the invalid cue condition when the cue was onset 500–1250 ms before the response, with the RT being especially shortened by the valid cue. This can be interpreted as the spatial attention being guided by the cue boosting the impact of information at that location.

When the cue was onset 0–500 ms before the response, we find that the difference between the two cue conditions is small, or the weights under the invalid cue condition appear to be slightly larger. This effect may appear to reflect suppressive effects of the cue onset itself on the response, but it is more likely a result of inclusion of trials in which the decision was made before the cue was presented, as was the case for the RT data (Fig. 2b).

A two-factor ANOVA was performed on the peak impact for the centre in Fig. 4b for the cue condition (valid, invalid) and the time from the response to the cue onset (five epochs). The results show that the main effect of the cue condition (F(1,9) = 4.40, p = 0.0654, η2 = 0.0343) was not significant, but the main effect of the time from the response to the cue onset (F(4,36) = 4.80, p = 0.0033, η2 = 0.0711) and the interaction (F(4,36) = 8.80, p < 0.0001, η2 = 0.1060) were significant. Corresponding t-tests conducted for each epoch showed significant differences for 0–500 ms (t(9) =  − 2.59, p = 0.0291, d = − 0.6759), 500–1000 ms (t(9) = 3.37, p = 0.0082, d = 1.0702), and 750–1250 ms (t(9) = 3.04, p = 0.0141, d = 1.0031). ANOVA on the peak impact for the surrounds showed no significant effect.

Discussion

The present study applied the response-locked CI method to analyze the effect of attentional cueing on dynamic decision making in a simple contrast detection task. Consistent with the results of many previous studies13,14,15,48, we found that the RT was shortened by the valid cue, especially in the period of 300–1100 ms from cue onset. Furthermore, although the CIs had a spatiotemporally biphasic profile as in the previous study45, their overall amplitude was boosted by the valid cue, which also shortened the RT.

The reduction of the RT by the valid cue began to appear 300 ms from cue onset and was long lasting. Importantly, this response facilitation was observed even when the valid cue was presented more than 500 ms before the response. These results suggest that the valid cue summoned the observer's sustained attention to the target location and facilitated the response.

When cue the was onset 500–1250 ms before the response, CIs showed clear differences in the overall amplitude of the CI between the two cue conditions, while their spatiotemporal tuning profiles were relatively constant. Under the valid cue condition, the cue directed attention to the target region and increased the weight of information, thereby enhancing contrast detection sensitivity. Under the invalid cue condition, the cue directed attention to the opposite side of the target, which relatively de-emphasized information in the target region and suppressed contrast detection. These results are consistent with a large body of psychophysical evidence that indicate an increase in the contrast sensitivity to stimuli at an attended location18,19,20,49,50,51,52 (see also the Computational model section regarding sensitization).

Computational model

On the basis of the above considerations, we examined whether the present results can be accounted for by a standard model of perceptual decision making with an additional assumption of the attentional enhancement of the perceptual sensitivity. The previous study assumed a hybrid model comprising a decision process that serially accumulates the outputs from the perceptual process, and it successfully predicted RTs and CIs in a similar display without spatial cueing45. Here, we extended the model by incorporating the effect of attention as an increase in the gain of the perceptual process, and we attempted to predict the characteristic variability of RTs and CIs depending on the cue validity.

Figure 5 presents an outline of the model. The model compares the spatially summarized outputs between two fields of the perceptual process approximated using a linear spatiotemporal filter. The decision process accumulates the differential signal of the two fields as sensory evidence over time and makes a decision when the evidence reaches a given boundary. The perceptual process is approximated as a spatiotemporal linear filter, and the decision process follows the standard drift–diffusion model for a two-alternative forced-choice task53,54,55,56,57,58. Figure 5 specifically illustrates each step in the case that the target appears on the left. In the model, the effect of spatial attention at the cued location is implemented as the amplification of the perceptual response in either the left or right field. The model assumed in the present study is very similar to traditional weighted integration models59,60. In particular, the effective sensitivity increase discussed as gain in our model can be interpreted as weighting in those models. The computation of each step is described in detail below.

Figure 5
figure 5

Schematic diagram of a model based on spatiotemporal filtering and the accumulation of sensory evidence. See the main text for details.

The perceptual system is approximated as a space–time separable linear filter, \({F}_{st}\left(x,t\right)\), written as

$${F}_{st}\left(x,t\right)= {F}_{s}\left(x\right)\cdot {F}_{t}\left(\mathrm{t}\right),$$
(2)

where \({F}_{s}\left(x\right)\) is the spatial filter, \({F}_{t}\left(\mathrm{t}\right)\) is the temporal filter, t is the frame number (33 ms per unit), and x is the pixel (0.018 deg per unit). The spatial filter \({F}_{s}\left(x\right)\) is given as a difference-of-Gaussians function, which has been widely used as a first-order approximation for contrast detectors in the early visual system61:

$${F}_{s}\left(x\right)= exp\left(\frac{-{x}^{2}}{{2\sigma }_{c}^{2}}\right) -\frac{{\sigma }_{c}^{2}}{{\sigma }_{s}^{2}} \cdot exp\left(\frac{{-x}^{2}}{{2\sigma }_{s}^{2}}\right).$$
(3)

Here, \({\sigma }_{c}\) is the standard deviation of the center and \({\sigma }_{s}\) is the standard deviation of the surrounds. The temporal filter \({F}_{t}\left(\mathrm{t}\right)\) is given as a biphasic function62,63:

$${F}_{t}\left(\mathrm{t}\right)= \left(\frac{1}{n!}-B\frac{{\left(t/\tau \right)}^{2}}{\left(n+2\right)!}\right)\cdot {\left(t/\tau \right)}^{n}\mathrm{exp}\left(-t/\tau \right),$$
(4)

where \(n\) is the number of stages of the leaky temporal integrator, τ is the transient factor, and \(B\) is a parameter that defines the amplitude ratio between the positive and negative weights. Figure 6 shows the function of the temporal filter represented in Eq. (4).

Figure 6
figure 6

Impulse response function of the temporal filter represented in Eq. 4. (a) Function from the transient factor (\(\tau\)) from 1 to 3 (\(B=0.75\), \(n=5\)). (b) Function from the amplitude ratio between the positive and negative weights (B) from 1 to 0.5. (\(\tau =2\), \(n=5\)).

The response of the perceptual system \(R\left(x,t\right)\) is then obtained by convolving the spatiotemporal filter \({F}_{st}\left(x,t-{t}_{0}\right)\), where \({t}_{0}\) is the time of stimulus onset, with the stimulus input \(I\left(x,t\right)\), and \(A\) is the gain of the system. Note that the gain \(A\) is assumed to be 1 typically and amplified in fields to which attention is directed by spatial cuing.

$$R\left(x,t\right)=\mathrm{A}\cdot {F}_{st}\left(x,t-{t}_{0}\right)*I\left(x,t\right).$$
(5)

In the present study, we did not assume internal noise in the perceptual system for the practical purpose of limiting the number of model parameters; however, this omission might have resulted in ambiguities in the interpretation of the results. If the gain is implemented before the addition of location-specific internal noise, this results in an increase of the signal-to-noise ratio (sensitivity). By contrast, if internal noise is added after the integration of information, the gain becomes the weighting of evidence across locations without affecting the signal-to-noise ratio of individual locations. The difference between these two functions should be investigated by further experiment and modeling that focus on the question.

Determination of whether the target is presented in the left or right field is made by comparing the spatial sum of the absolute values of the responses in each field. The modeling is assumed to continuously monitor the difference ΔR(t) between the left and right responses at time t from the stimulus onset. Here, ΔR(t) is considered the sensory evidence at time t in the decision process:

$$\Delta {\text{R}}\left(t\right)={\sum }_{x}\left|{\text{R}}_{{\text{l}}{\text{eft}}}\left(x,t\right)\right|-{\sum }_{x}\left|{\text{R}}_{\text{right}}\left(x,t\right)\right|.$$
(6)

Decisions abouts targets are based on evidence accumulated over time. However, many decision-making studies suggest that sensory evidence decays with time29,33,64. This process is approximated by leaky temporal integration, which is mechanically ascribed to adaptive gain control65,66. The model thus assumes the cumulative evidence \(S(T)\) at time T is obtained from an approximation of the noisy leaky integral of ΔR(t) as

$$S(T)= \sum_{t=1}^{T}({\upgamma }^{\left(T-t\right)}\Delta {\text{R}}\left(t\right)+ {\epsilon }_{t}),$$
(7)

where \(\gamma\) is the time constant of evidence integration and \({\epsilon }_{t}\) is the internal noise having a normal distribution. The model observer makes a decision about whether the target is on the left or right when \(S(T)\) reaches a certain decision boundary; i.e., \(b\) or \(-b\), respectively. Finally, the model observer is assumed to execute a manual response after a constant motion delay of 250 ms from T.

Our psychophysical data suggest a reduction in RT and an increase in the CI amplitude for the target at the cued location. According to the model architecture described above, this attentional facilitation must occur prior to the comparison of the left and right fields. We therefore chose to incorporate the effect of attention as a sustained increase in the gain of the perceptual system following the cue onset. Specifically, we assumed that the increase in gain due to attention begins with a fixed delay of 50 ms following the cue onset.

In our model simulations, we attempted to reproduce the impact curve (Fig. 4) for the data of each observer, using the stimuli actually shown to the observer. The parameters of the model were estimated to minimize the RMS error with the impact curve, separately for each of the valid and invalid cue conditions. Under the invalid cue condition, the gain (\(A\)) was fixed at 1. In accordance with previous studies45, the number of stages of the temporal filter (\(n\)) was fixed at five.

Figure 7 shows the results of model simulation. It is seen that the model reproduces many aspects of the human data. Figure 7a shows a continuous reduction in the RT under the valid cue condition. Figure 7b and c presents larger amplitudes of the CI under the valid cue condition than under the invalid cue condition, except for the period immediately after the cue onset.

Figure 7
figure 7

Reaction times and CIs predicted by the model. Each panel corresponds to the results for human observers in Figs. 2 and 4. (a) Variation in the RT with cue. RTs for the no cue condition were estimated by averaging the valid and invalid conditions. (b) CIs and impact curves for each 500-ms epoch of the cue onset from the response. (c) Peak impact near 350 ms before the response.

Estimated parameters and the S.E. across model observers were [σc, σs, B, τ, γ, \(b\), \({\epsilon }_{t}\), \(A\)] = [2.94, 16.9, 0.505, 2.38, 0.207, 356.0, 10.6, 1.20 (S.E. = 0.13, 0.93, 0.006, 0.11, 0.004, 7.09 , 0.69, 0.025)] for the valid cue condition and [σc, σs, B, τ, γ, \(b\), \({\epsilon }_{t}\)] = [2.92, 15.7, 0.490, 2.41, 0.212, 353.5, 9.80 (S.E. = 0.13, 0.59, 0.006, 0.10, 0.007, 7.22, 1.40)] for the invalid cue condition. The relative gain \(A\) of the spatiotemporal filter (1.20) is obviously greater than 1 (t(9) = 7.69, p < 0.0001, d = 3.438) for the valid cue condition, providing evidence of the sensory amplification by attention. Meanwhile, the other parameters did not significantly differ between the valid and invalid cue conditions, except for the ratio of positive to negative phases (\(B\)) in the temporal filter, which was significantly larger under the valid cue condition (t(9) = 2.47, p = 0.0355, d = 0.8368), but the difference was very small (0.505 vs. 0.490).

Taken together with the results of the previous study45, the present results suggest that perceptual decision making in simple contrast detection can be well described by a simple perceptual decision model. Adding to this basic result, the present data suggest that the facilitation of behavioral decision making via spatial attention is explained by the increased gain of the perceptual process. The effect of attention is not a modulation of the shape of the perceptual filter by the presence or absence of a cue but simply a change of gain (weighting), which is consistent with the results of previous studies36,60.

It is noteworthy that the variability of the RT due to a spatial cue was ~100 ms or more. This variability is considerably larger than values reported in many attention studies 13,14,48. If we convert this 100-ms difference to the target bar contrast, we can infer that the valid cue enabled observers to detect targets with nearly half (~0.3 log units) the contrast of targets detectable with the invalid cue. One possible cause of this sensitization is the suppression of external noise by attention, which has been demonstrated by a larger attentional enhancement of contrast detection for a target flashed with noise than that without noise 50. However, it has also been indicated that such attentional facilitation in noise is found only when the target is flashed together with noise 50. The other possibility is the increase in the absolute sensitivity for targets with gradually increasing contrast over time, such as those used in the present study. Psychophysical evidence indicates that attention increases contrast sensitivity by 0.05–0.1 log units for targets of a stepwise waveform 18,19, but by as much as 0.5–1.0 log units for targets with a gradual onset, regardless of the presence of external noise 52. This large sensitization to the gradual target is observed even in the absence of background noise, which indicates an increase in the absolute detection sensitivity (i.e., reduction of internal noise) rather than the reduction of external noise 52.

However, it is noted that the present finding does not rule out a variety of other possible explanations. Even assuming the model architecture shown in Fig. 5, it is difficult to determine exactly at what stage the attentional modulation of gain occurs. Our model assumes that the sensitivity of the spatiotemporal filter is modulated, but it is also possible that the spatially summarized output of the perceptual processes is modulated. Additionally, in the present study, we assumed that left and right evidence was integrated linearly; however, Shimozaki et al.60 demonstrated that the sum of weighted likelihoods could more adequately approximate human observers. The magnitude of the gain (weighting) effect at each location could be modeled more elaborately. Furthermore, if we assume another type of decision model such as a ballistic model67,68, which can deal with two sensory evidence signals independently for each of the two target fields, one cannot discriminate whether attention affects the perception process for each target or the decision process for each evidence signal. As pointed out in previous studies45, it is generally difficult to distinguish between the involvements of perceptual and decision processes in the data of psychophysical reverse-correlation.

There are some limitations in the experimental design of the present study, which should be considered when interpreting the results. Initially, no result was obtained for CIs in the non-target field (Fig. 3b). This is likely to be caused by the design of the present experiment, in which observers could wait for the target until they were certain to respond to it. In fact, the average correct rate was 98%. There are two main problems with CI analysis that relies only on the target field. First, the target-present CIs are known to be biased in their estimation of the underlying weights the observer applies to the image69,70. Many different sets of weights of stimulus features (pixels) can result in a CI that looks like the target. Therefore, it is possible that the increase in impact (center) is partly attributed to the increase in luminance of the target. Second, as indicated in the literature70,71,72, uncertainty in the experiment could cause the difference in CIs between target and non-target fields. In the present study, it was essential that the target were gradual and presented continuously until it was detected by the observer. However, this inevitably resulted in an experimental paradigm with a large amount of temporal uncertainty.

Additionally, it should be noted that we could not distinguish between eye movements and covert attention shifts because eye tracking was not available. It has been suggested that the effects of visual search and cue can be explained by the same model36, and that the effect predicted by the model can be more fit to the observation with eye movement73. Although our observers were strongly instructed to maintain rigid fixation and the cue was flashed for a short time (100 ms), it was still possible for them to make eye movements during the stimulus presentation. Therefore, there should be caution regarding whether the present results are essentially ascribed to the effect of covert attention.

It is of interest to know how long the cueing effect persists. Murai and Whitney (2021) recently applied the CI method to the simple orientation detection task and demonstrated the serial dependence effect over several seconds74. Because the cue was rarely onset more than 1500 ms before the response in our experiment, it was difficult to systematically analyze the duration of the cueing effect. However, the CIs indicated that the cueing effect was persistent, even when the cue was onset 1000–1500 ms before the response (Figs. 4 and 7), which suggests that attentional facilitation could last longer than approximately 1 s. It would be worth exploring the possibility of long-lasting cueing effects in a future investigation using CI analysis.