Response-locked classification image analysis of perceptual decision making in contrast detection

In many situations, humans make decisions based on serially sampled information through the observation of visual stimuli. To quantify the critical information used by the observer in such dynamic decision making, we here applied a classification image (CI) analysis locked to the observer's reaction time (RT) in a simple detection task for a luminance target that gradually appeared in dynamic noise. We found that the response-locked CI shows a spatiotemporally biphasic weighting profile that peaked about 300 ms before the response, but this profile substantially varied depending on RT; positive weights dominated at short RTs and negative weights at long RTs. We show that these diverse results are explained by a simple perceptual decision mechanism that accumulates the output of the perceptual process as modelled by a spatiotemporal contrast detector. We discuss possible applications and the limitations of the response-locked CI analysis.

In the present study, we applied the response-locked CI analysis to the most basic visual task, luminance contrast detection. Specifically, we used stimuli similar to those used by Neri and Heeger 18 to measure responses and reaction times for target stimuli that emerge slowly in dynamic noise, and we then analyzed the correlation between the noise and response at each time point backward, locked to the observer's reaction time. This protocol allowed us to examine what signals and what point in the stimulus determined the observer's decision about the target and the observer's reaction time. The results revealed spatiotemporally biphasic CIs similar to those reported by Neri and Heeger 18 . On the other hand, we also found that the profile of the CI substantially varied depending on the response time of the observer in a way that was unpredictable from the response properties of the early visual system. These apparently complicated results, however, were quantitatively described by a simple computational model incorporating a perceptual process approximated by a spatiotemporal filter and a decision process (drift-diffusion) that accumulates its output 32,43 .

Methods
Observers. Five naïves and two of the authors (average age: 22.8 years) with corrected-to-normal vision participated in the experiment. All experiments were conducted with permission from the Ethics Committee of the University of Tokyo. Observers gave written informed consent. The study followed the Declaration of Helsinki guidelines.
Apparatus. Visual stimuli were displayed on a gamma-corrected LCD monitor (BENQ XL2735) controlled by a PC. The refresh rate was 60 Hz, and the pixel resolution was 0.04 deg/pixel at the viewing distance of 50 cm that we used. The mean luminance of the uniform background was 88.9 cd/m 2 . All experiments were conducted in a dark room.
Stimuli. The visual stimulus was square dynamic one-dimensional noise (4.8 × 4.8 deg) comprising 16 vertical bars with a width of 0.3 deg (Fig. 1). The contrast (C noise (t) ) of each bar was switched at a frame rate of 30 Hz according to Gaussian noise with an RMS contrast of 0.1. The total duration was 8000 ms. Two independent 1D-noise fields were presented adjacent to the fixation point.
Here, t is the frame number (33 ms per unit) from stimulus onset. α is the rate of increase, which was set at three levels: 0.05, 0.1, and 0.2. The contrast of each bar was clipped in the range of − 1 to + 1. The two fields, with and without the target signal, were transformed into luminance images using the relation L(t) = L mean (1 + C(t)), where L mean is the mean luminance of the uniform background (88.9 cd/m 2 ).
Procedure. In each trial, observers viewed the stimulus at a fixation point binocularly and indicated by pressing a button whether the target appeared in the left or right noise field as quickly as possible. If an observer's response exceeded the deadline (8000 ms) or was an error, auditory feedback was given, and the data recorded in that trial were excluded from the analyses. The next trial started no less than 0.5 s after the observer's response. The average error rates were 0.03, 0.02, and 0.02 for contrast increases (α) of 0.05, 0.1, and 0.2, respectively. In Ethics approval. All experimental protocols were approved by Ethics Committee of the University of Tokyo.
Approval for human experiments. Written informed consent was obtained from all participants/ observers.

Results
Reaction time. Figure Figure 2b shows that a slower rate of increase in the contrast resulted in a longer average reaction time. One-way repeated-measure ANOVA on the average reaction time with increasing contrast showed a significant effect of the rate of contrast increase on the reaction time (F(2, 12) = 1574.6, p < 0.001).
Reverse correlation analysis locked to the response time. We conducted a reverse correlation analysis between the contrast of each bar and the observer's response (left, right) at each time (t) back from the reaction time to characterize the noise common to the time before the reaction. Figure 3 is a diagram of the analysis. As in Neri and Heeger 18 , μ 1 (x,t) is the mean of the noise contrast in the region where the observer responded that  www.nature.com/scientificreports/ the target was present and μ 0 (x,t) is the mean of the noise contrast in the region where the observer responded that the target was not present. Since the reaction time varied across trials, the number of trials used to calculate the mean at each time point was not constant and tended to decrease as time increased backward from the response. The results were calculated as follows.
here Mean Kernel refers to the effect of the noise contrast on the response. The upper panels in Fig. 4 show the classification image (i.e., Mean Kernel) obtained in the reverse correlation analysis locked to the reaction time. The horizontal axis represents the time (t) back from the reaction time, and the vertical axis represents the spatial position of each bar (x). In comparison with the grey background, the brighter points represent positive weights and the darker points represent negative weights. Individual panels show results for a contrast increase (α) of 0.05, 0.1, or 0.2. The lower panels show the mean of the weights of the two central bars (red) and the mean of the weights of the two adjacent bars (blue) in the CI. The vertical axis represents the weight and the horizontal axis represents the time from the reaction time. We refer to the plots as impact curves.
The above plots show characteristic temporal and spatial variations in the weights before the target detection response. At the center of the stimulus where the target appeared, a large positive weight was found about 300 ms before the response, and a negative weight was found about 500 ms before the response. For the spatial variation, we find that the weights around the target are reversed from the center. This means that the decrease in luminance in the bars adjacent to the target are also useful for detecting the target. It is also found that the absolute magnitude of the weights tends to decrease as the contrast increase rate increases. We also conducted the same analysis for contrast variance, as was done in a previous study 18,44 , but found no clear CI profile. This discrepancy could be due to the fact that the target appeared gradually in the present study, whereas the target was abruptly flashed in the previous study. Fig. 2, the reaction time of the observer varied even under the same conditions. Each individual observer responded quickly in some trials and took a long time in others. Taking advantage of this fact, we investigated if and how the CI changes with the reaction time. To this end, we divided the observer's data into 50% trials with short reaction times, 50% trials with intermediate reaction times, and 50% trials with long reaction times for each condition of the contrast increase rate, and carried out the reverse correlation analysis for each group. Figure 5 shows the CIs and impact curves obtained for each reaction time group. The results were surprisingly different across the groups. The positive weights about 300 ms before the reaction are larger for the shorter reaction time group. Conversely, the negative weights 500 ms before the reaction are larger for the longer reaction time group. This tendency is constant regardless of the contrast increase rate (α). The results indicate that the spatiotemporal profile of the weights of information correlated with the response is remarkably different depending on the reaction time.

Discussion
The present study examined the information utilization strategy adopted in dynamic decision making during stimulus observation in a simple contrast detection task. Applying the classification image method, we calculated the weights of the embedded noise at each time point retrospectively from the reaction time for the target. The resulting CIs indicate that observers responded by utilizing the biphasic luminance change and the central www.nature.com/scientificreports/ antagonistic spatial contrast before the response. In addition, we found that these spatiotemporal profiles of CIs varied significantly depending on the reaction time.
The complex diversity of the results depending on the reaction time appears to be difficult to understand intuitively. This may seem to indicate that the observers were so flexible that they use different strategies for utilizing information depending on whether they could respond quickly to the target or not. However, we should first consider a simple explanation-that the results are a natural consequence of an interplay between the sensory system and the decision process. Therefore, we tested a simple model consisting of the early visual process (linear filtering model) and the perceptual decision process (drift-diffusion model). As a result, we found that this conservative model with a fixed set of parameters successfully duplicated the human data for all conditions and RT ranges.
Computational model. Figure 6 shows an outline of the model, which is inspired by a previous study on spatiotemporal ensemble perception (Yashiro et al. 28 ). The model compares the spatially summarized outputs of the perceptual process, which is approximated using linear spatiotemporal filters, between the two regions. The decision process accumulates the differential signal between the two regions as sensory evidence over time and makes a decision when the evidence reaches a given boundary. The basic structure of the perceptual process follows that of a previous CI study (Neri and Hereger 18 ), and the computation of the decision making follows traditional drift-diffusion model (DDM) for a two-alternative forced-choice task [32][33][34][35] . Figure 6 shows each step of the process graphically for the case that a target appears on the left. The calculation of each step is described in detail below.
Following previous studies 18 , the perceptual system is approximated as a space-time separable linear filter, F st (x,t), as follows.
Here F s (x) is the spatial filter and F t (t) is the temporal filter. The spatial filter F s (x) is given as a DoG function, which has been widely used as a first-order approximation for contrast detectors in the visual system 36 Here σ c is the standard deviation for the central region and σ s is the standard deviation for the adjacent region. The temporal filter F t (t) is given as the following biphasic function 37,38 .
Here n is the number of stages in the time integrator, t is the transient factor, and B is a parameter that defines the amplitude ratio of the positive and negative phases.
The response of the perceptual system was obtained by convolving the above spatiotemporal filter F st (x,t) with the stimulus input I(x,t).
Decisions concerning whether the target presented in the left or right region were made by comparing the spatial sum of the absolute values of the responses in each region between the left and right. Thus, the model observer continually monitored the difference ΔR(t) between the left and right responses at time t from the stimulus onset. Here, ΔR(t) is regarded as the sensory evidence at time t in the decision-making model. www.nature.com/scientificreports/ Decisions for targets are based on evidence accumulated over time. However, a number of decision-making studies suggest that sensory evidence decays with time; that is, the evidence weakens as it ages 22,39,40 . This property is practically described as a leaky temporal integration, and it is potentially a product of the adaptive gain control of evidence signals 41,42 . According to these findings, the present modeling assumes that the cumulative evidence S(T) at time T is given by the following equation, which approximates the noisy leaky integration of ΔR(t).
Here g is the time constant of evidence integration and ε t is the internal noise following a normal distribution. The model observer makes a decision about whether the target is on the left or right when S(T) exceeds a certain decision boundary, that is, b or − b, respectively. The observer was assumed to execute a manual response after a constant motor delay of 250 ms from T.
In this modeling, the perceptual process part has five parameters: the standard deviations of the spatial filter (σ c and σ s , in pixels), number of biphasic temporal filter integration stages (n), time constant (τ , in frames), and ratio of positive to negative phases (B). The decision-making process part has three parameters: the decision boundary (b), internal noise (ε t ), and time constant for evidence reduction (γ, in frames). www.nature.com/scientificreports/ Model simulation. We analyzed the CI and impact curves of the model observer using the image input data that were presented to each observer in the experiment. In the simulations, for all data in the condition of α = 0.05, the model parameters were optimized for each observer to minimize the squared error between the impact curve obtained for the model observer and that of the human observer. To achieve the steady fitting, only the number of integration steps of the biphasic temporal filter (n) was fixed to 5, for all model observers. Figure 7 shows the simulation results. The thick impact curve represents the average of results obtained for the optimized model for each observer, and the light-colored bands represent the ± 1 se range of the average for the human observer data. Estimated parameters and the s.e. across model observers were [σ c , σ s , B, τ, γ, 1.67, 0.07, 0.158, 0.002, 3.58, 2.64). For all values of the contrast increase (α), we find that the model successfully duplicated both the CI and the impact curve of the observers. For the three different reaction time groups (Fig. 7b-d), the model duplicated the observers' data, reflecting the characteristic differences of RT-dependent CI and impact curves. The root-mean-square error (i.e., difference) between the fitted models and the model observers' data averaged over all observers was 0.005 (s.e. = 0.0001). The difference in the behavior of the impact curves for each reaction time group can be intuitively explained as follows. In the short reaction times, the positive weights are larger about 300 ms before the response because there is not enough time for the negative part of the biphasic temporal impulse response to activate. On the other hand, in the long reaction times, the negative weight of the biphasic temporal impulse response acts and becomes visible, but the positive weight is thought to occur because the effect decreases via leaky integration.
To investigate the importance of the functional processes assumed in the model in Fig. 6, we simulated the model without some of the functional processes. We found that (1) the unique shape of the observers' CIs and impact curves could not be simulated if even one of the parameters of the spatiotemporal filter was omitted and (2) without the leaky integration property being assumed in the decision making, the effect of the early stage of stimulus presentation did not decrease even after a long observation in some RT ranges. On the other hand, modifying the model to accumulate the responses in each domain separately as two pieces of evidence and then calculate those differences, instead of accumulating the differences in responses between the two domains as evidence, did not change the behavior of the model, because the model essentially accumulates evidence linearly 43 .
The present results support the idea that on-the-view behavioral responses to visual stimuli can be explained by a simple combination of the conventional perceptual model and the standard perceptual decision-making model. This finding may allow us to perform a response-locked reverse correlation analysis of human responses to sensory stimuli during observation, rather than after observation, to explore the characteristics and strategies of human information use in various cognitive tasks. In further investigations, a similar framework may be used to understand the mechanisms for attentional selection and for high-level visual cognition. The present computational model can be used as a baseline account in these investigations.
It should be noted that psychophysical analysis cannot reliably separate the properties of decision making from the low-level perceptual process 31 . Although one can partially overcome this limitation by making full use of various aspects of data, such as by dividing the data into different RT ranges as in the present study, it is difficult to distinguish between some properties such as the latency of the perceptual sensors and the motor delay in the decision process.