To perform visual search, humans, like many mammals, encode a large field of view with retinas having variable spatial resolution, and then use high-speed eye movements to direct the highest-resolution region, the fovea, towards potential target locations1,2. Good search performance is essential for survival, and hence mammals may have evolved efficient strategies for selecting fixation locations. Here we address two questions: what are the optimal eye movement strategies for a foveated visual system faced with the problem of finding a target in a cluttered environment, and do humans employ optimal eye movement strategies during a search? We derive the ideal bayesian observer3,4,5,6 for search tasks in which a target is embedded at an unknown location within a random background that has the spectral characteristics of natural scenes7. Our ideal searcher uses precise knowledge about the statistics of the scenes in which the target is embedded, and about its own visual system, to make eye movements that gain the most information about target location. We find that humans achieve nearly optimal search performance, even though humans integrate information poorly across fixations8,9,10. Analysis of the ideal searcher reveals that there is little benefit from perfect integration across fixations—much more important is efficient processing of information on each fixation. Apparently, evolution has exploited this fact to achieve efficient eye movement strategies with minimal neural resources devoted to memory.
Recent decades have seen considerable progress in understanding visual search11,12, eye movements1,2,13 and active robotic vision14; however, there is no formal theory of optimal eye movement strategies in conducting visual search. Such a theory would provide insight into the design requirements for effective control of eye movements and attention, and hence could serve as a powerful framework for analysing the behaviour and neurophysiology of eye movements and attention, and for developing robotic applications.
We consider the task of finding a known target that is embedded (added) at a random location in backgrounds of spatial 1/f noise, which have the same spatial power spectra as images of natural scenes7. Figure 1a shows the target (a spatial sine wave) and a sample of 1/f noise.
Not surprisingly, the optimal eye movement strategy depends critically on how the visibility (detectability) of the target varies across the retina. Thus, to specify the ideal searcher, it is necessary first to characterize the visibility maps of the visual system under consideration, for the targets and backgrounds of interest.
To characterize the visibility map for each of the conditions in the search experiment described below, detection accuracy was measured for the sine-wave target as a function of target contrast and background noise contrast, at the 25 spatial locations indicated by the small circles in Fig. 1a. The observer fixated on the centre of the display, which was monitored with an eye tracker, and for each block of trials the target was presented at only a single known location. Figure 1b shows the measurements of detection accuracy (psychometric functions) in the fovea, as a function of target contrast. Each curve is for a different root-mean-squared (r.m.s.) contrast of the noise background, where the r.m.s. contrast is defined as the standard deviation of the pixel luminance divided by the mean luminance. Each of these psychometric functions can be summarized by a contrast threshold value, the target contrast that is detected with 82% accuracy, and by a parameter that describes the steepness of the function (see Methods). Figure 1c plots the foveal contrast thresholds (in units of contrast power—the square of the contrast) for two observers. The thresholds fall approximately along a straight line, in agreement with previous studies using white-noise backgrounds15,16,17. Figure 1d plots the relative threshold (see the figure legend) as a function of noise contrast power for all the retinal locations. The relative thresholds cluster around a straight line, showing that an approximately linear relationship holds for all retinal locations tested; however, there are systematic changes in the slopes and intercepts of the lines with the distance of the target from the fovea (the retinal eccentricity), as shown in Fig. 2a, b. These slope and intercept functions (together with the steepness parameter of the psychometric function) can be used to determine the visibility map for any combination of target and background contrast. Figure 2c shows a cross-section of the maps for two of the conditions in the search experiment described later. Each map specifies, for every retinal eccentricity, the value of a signal-to-noise ratio, d′, which is monotonically related to detection accuracy (see Methods)3.
Now consider the ideal searcher for the relatively simple search task in which the target location is unknown but the stimulus is presented briefly so that no eye movements are possible. In this case, the optimal method is template matching (that is, cross-correlation)3,4. At each potential target location, the retinal image is multiplied by a template of the target and the product is integrated to obtain a template response. If all target locations have equal prior probability, and if the visibility map is flat, then the location with the largest template response is the most likely location of the target (that is, the location with the greatest posterior probability). To compute posterior probabilities when the visibility map is not flat, template responses are weighted by the visibility at each potential target location.
The temporally extended search task with eye movements introduces two new requirements for the ideal searcher: optimal integration of responses across fixations, and optimal selection of successive fixation locations. The flow diagram of processing performed by the ideal searcher is shown in Fig. 3a. During the first fixation the searcher captures responses from all potential target locations. It then computes the posterior probability that the target is located at each of these locations. If the maximum of the posterior probabilities exceeds a criterion (the criterion determines the error rate), the search stops and the location with the largest posterior probability is reported. If the criterion is not exceeded, the ideal searcher determines the fixation location that will maximize the probability of finding the target after the eye movement is made. It then moves its eyes to that location, and the process repeats.
To integrate responses across fixations optimally, the ideal searcher cumulates the weighted responses from each potential target location:
where t is fixation number, and d′ik(t) and Wik(t) are the visibility and response at display location i when the fixation is at display location k(t). Equation (1) is for the case in which both stimulus noise and internal noise are statistically independent in time (dynamic). Equations for the more complicated case, in which the noise is a mixture of static stimulus noise and dynamic internal noise (like the present experiments), are given in the Supplementary Information. The predictions reported here are for the static case, although predictions for the dynamic case are similar (see Methods).
To compute the optimal next fixation point, kopt(T + 1), the ideal searcher considers each possible next fixation and picks the location that, given its knowledge of the current posterior probabilities and visibility map, will maximize the probability of correctly identifying the location of the target after the fixation:
Maximizing accuracy is the relevant goal in the present task; other goals (such as minimizing entropy) have been explored for certain computer vision tasks18. Explicit expressions for equation (2) are derived in the Supplementary Information.
Given the prior probabilities of possible target locations and the visibility maps, which specify all the relevant values of d′, equations (1) and (2) can be used to simulate the behaviour of the ideal searcher (see Methods). Figure 3b shows a sequence of ideal fixations. The ideal searcher performs a rather random-looking search pattern, although it is in fact a highly principled search that reflects the specific properties of the stimulus and the visibility map. This figure also shows how posterior probabilities across the display evolve over time. The location of the target is at the peak in the posterior probability map seen downward of the left of centre.
The ideal searcher shows several other interesting qualitative behaviours. First, it sometimes makes fixations to the display location with the maximum posterior probability of containing the target (MAP fixations, where MAP is short for maximum a posteriori), and sometimes to a location near the centroid of a cluster of locations where the posterior probabilities are high (‘centre-of-gravity’ fixations). Both MAP and centre-of-gravity fixations have been observed in human visual search19,20,21. Second, the saccade lengths of the ideal searcher tend to be moderate in size, because posterior probabilities at nearby locations are pushed down and posterior probabilities at distant locations tend not to jump up (see Fig. 3b). Human saccade lengths also tend to be moderate in size (see later). Third, the ideal searcher tends not to fixate display locations that it has recently fixated (‘inhibition of return’), again because nearby posterior probabilities are pushed down (see Fig. 3b). Fourth, the ideal searcher sometimes makes long saccades into regions where the posterior probabilities are low, followed by a return saccade to a region with higher probabilities. It performs these eye movements because excluding an unlikely region that has not yet been inspected is sometimes the best chance for increasing the posterior probabilities in the more likely regions. It is unknown whether humans perform these ‘exclusion saccades’ in visual search, although a related type of saccade is predicted for optimal eye movements in reading22.
To compare human and ideal search quantitatively, we measured search performance for the sine-wave target randomly embedded at one of 85 locations tiling the 15° diameter display in a triangular array. Measurements were made for two levels of 1/f noise contrast (0.05 and 0.2) and for six levels of target visibility in the fovea (d′ = 3, 3.5, 4, 5, 6 and 7). The visibilities were set by using the results from the detection experiment (see Figs 1 and 2). The data points in Fig. 4a show the median number of fixations required for two human observers to find the target. (We obtained a similar search performance for a third observer naive to the aims of the study, although we do not have visibility maps for that observer.) As can be seen, search performance improves as the visibility of the target increases and is better in the high-noise condition (presumably because the visibility maps are broader; see Fig. 2c). The solid curves show the predictions of the ideal searcher with the same visibility maps as the human observers. Figure 4b shows how human and ideal search performance varies with location of the target in the display, for all 12 stimulus conditions. The results in Fig. 4a, b imply that humans are remarkably efficient at visual search, at least under these conditions, nearly reaching the performance of the ideal searcher.
The obvious question is: How do humans perform so well? To address this question, consider the three things that the ideal searcher does in an optimal fashion: parallel detection, integration of information across fixations, and selection of the next fixation location. With regard to parallel detection, there is evidence that humans are efficient at finding targets in noise under conditions in which visibility is equal at all potential target locations4,23. Further, in brief presentations that do not allow eye movements, humans often process multiple target locations in parallel with equal efficiency11,23,24. Our results indicate strongly that humans are able to perform this kind of efficient parallel processing in complex extended search tasks.
With regard to the integration of information across fixations, there is evidence that humans are not very efficient8,9,10. So how can they approximate ideal performance? A key insight is provided by the solid curve in Fig. 4c, which plots the average posterior probability at the target location as a function of the number of fixations before the one at which the ideal searcher found the target. The rapid rise in posterior probability implies that there is little to be gained by integrating detailed display or posterior probability information more than one or two fixations into the past. However, simulations show that to achieve human performance levels it is necessary to have some coarse memory for past fixation locations so as to reduce the likelihood of returning to the same display region.
With regard to the selection of fixation locations, there is little evidence about human efficiency. To examine this issue we simulated searchers that do not select fixation locations optimally but are otherwise ideal. For example, the dashed curves in Fig. 4a show the performance of a searcher that computes posterior probabilities and integrates information across fixations optimally, but makes random fixations. Humans far outperform this random searcher. They also outperform an enhanced version with the added feature of not fixating the same location twice. The fact that humans outperform these searchers rejects all possible models of visual search where fixation locations are selected at random, with or without replacement. We have also evaluated a sub-optimal searcher that always fixates the location with the maximum posterior probability (the MAP searcher). This searcher performs almost as well as the ideal (0.5–2 fixations slower), and thus MAP fixations alone (no centre-of-gravity fixations) are sufficient to approach the efficiency of human fixation selection.
The fixation selection strategies of the searchers can also be compared by examining eye movement statistics. For example, the distributions of saccade length for the humans and the ideal searcher are similar: both are skewed to the right and reach a peak at about 3°, although the ideal distribution is more skewed than the human. The distribution for the random searcher is symmetrical and reaches a peak at about 7°. The distribution for the MAP searcher is similar to the ideal and the human, but less skewed than the human. We also examined the spatial distribution of fixation locations within the display. For both human and ideal searchers the average distribution of fixation locations across the display has a ‘doughnut’ shape that reaches a peak at about 5° from the centre. Interestingly, the MAP searcher does not have a doughnut-shaped distribution, indicating that humans are not well modelled as MAP searchers.
The present study is another example of how bayesian ideal observer analysis can provide valuable insight into sensory and perceptual processing5,6,22. The ideal searcher is, in some ways, complementary to existing computational models of visual search25,26. Unlike these previous approaches, it is not a heuristic model that can be applied to arbitrary stimuli but a formal, parameter-free analysis for a particular class of naturalistic stimuli. The ideal searcher is not meant to be a plausible model of human visual search (for example, humans do not have perfect memory), but a rigorous starting point for developing models. It remains to be determined what algorithms the brain uses to achieve near-optimal performance for our stimuli, how those algorithms are implemented in neural circuits, and how well those algorithms perform on a more general range of stimuli (for example, on natural images). Nonetheless, it is clear that humans compute something close to an accurate posterior probability map and then use that map to determine the next fixation location efficiently. They are able to reach near-optimal performance, despite relatively poor memory for visual detail and poor integration of information across fixations, because all that is actually needed is a coarse memory sufficient to support the inhibition of return.
Stimuli and apparatus
Stimuli were presented on a calibrated monochrome Image Systems monitor (M21L) with a white phosphor (P-104) and resolution 1024 × 768 at 60 Hz. The target was a 6 cycles deg-1 sine-wave grating, tilted 45° to the left and windowed by a symmetrical raised cosine (half-height width one cycle). The background was filled with 1/f noise at a mean luminance of 40 cd m-2; the display outside the background was set to the mean luminance. The 1/f noise was created by filtering gaussian white noise, truncating the filtered noise waveform to ± 2 s.d., and scaling to obtain the desired r.m.s. amplitude. Eye position was measured with a Forward Technologies SRI Version 6 dual Purkinje eye tracker. Head position was maintained with a bite bar and headrest.
On each trial, the observer fixated a dot at the centre of the display (there was no fixation dot for foveal measurements). The fixation dot disappeared at the onset of the test stimuli, which consisted of two 250-ms intervals separated by 500 ms. One interval contained background noise alone, the other a different random sample of background noise with the target added. The observer judged which interval contained the target. If the observer's fixation was not within 0.75° of the fixation dot when the test stimuli appeared, the trial was discarded. Within a session the target always appeared in the same location, which was indicated after each trial. Sixteen blocks (four target contrasts × four noise levels) of 32 trials were run in each session. The data for each noise level, in each session, were fitted with a Weibull function:
Both the 82% correct threshold parameter cT and the steepness parameter s were estimated by using maximum-likelihood methods.
On each trial, the observer first fixated a dot at the centre of the display and then initiated the trial with a button press. The fixation dot disappeared immediately and a random time later (100–500 ms) the search display appeared. The observer's task was to find the target as rapidly as possible and to press a button as soon as the target was located. The observer then indicated the judged target location by fixating that location and pressing the button again. The response was considered correct if the eye position at the time of the second button press was closer to the target location than to any other potential target location. Each session consisted of 6 blocks of 32 search trials. In each session the background noise contrast was fixed. Before each block, the eye tracker was calibrated and the search target for that block was shown on a uniform background at the centre of the display. The set of 12 conditions was repeated 6 times in a counterbalanced fashion.
The solid curves in Figs 1c, 1d, 2a and 2b describe how the threshold parameter cT(en,ɛ) varied with noise contrast power en and eccentricity ɛ. The steepness parameter of the psychometric functions varied with eccentricity from a value of 2 in the fovea to more than 4 in periphery, and was well approximated by s(ɛ) = 2.8ɛ/(ɛ + 0.8) + 2. These descriptive models were substituted into equation (3) to obtain a formula for detection accuracy: f(c,en,ɛ). The visibility map was obtained by taking the inverse standard normal integral of the accuracy: d′(c,en,ɛ) = √(2)Φ-1(f(c,en,ɛ)). The factor 2 takes into account that there were two intervals in the forced choice detection task, but (effectively) only one interval in each fixation of the search task3. The visibility maps were obtained by averaging across all directions; in fact, the visibility maps are not radially symmetric but fall off rather faster in the vertical direction than in the horizontal direction. Using the radial average has no effect on the predictions described here.
Here we describe the simulations for the case in which the external and internal noise is statistically independent (dynamic) in time; the Supplementary Information describes the case in which the external noise is static. We note first that the template responses at each location can be given an expected value of 0.5 when the target is at that location, and -0.5 otherwise. This has no effect on the predictions as long as we add enough noise to the simulated template responses to make the values of d′ exactly those given by the visibility map. The six steps for each simulated search trial were as follows. First, fixation began at the centre of the display (as in the search experiment). Second, a target location was selected at random (prior(i) = 1/85); 0.5 was added to that location and -0.5 to all other locations. Third, a gaussian noise sample was generated for each of the 85 potential target locations. The standard deviation of each noise sample was set to be consistent with the value of the visibility map at that location: σ = 1/d′. Fourth, the posterior probability for each potential target location was calculated from equation (1). Fifth, if the maximum posterior probability exceeded a criterion, the search stopped. The criterion for each condition was picked so that the error rate of the ideal searcher approximated that of the humans. Sixth, for the ideal searcher, equation (2) was evaluated to select the next fixation location; for the random searcher, the next fixation location was selected at random. The process then jumped back to the third step. Note that the specific characteristics of the target and 1/f noise enter the simulation through the visibility maps. Also note that although 1/f noise is spatially correlated, we verified by simulation that the template responses were effectively uncorrelated. The ideal searcher performs slightly better in the dynamic case than in the static case, because it gains more information on each fixation, but the predicted curves are nearly parallel.
Carpenter, R. H. S. (ed.) Eye Movements (Macmillan, London, 1991)
Liversedge, S. P. & Findley, J. M. Saccadic eye movements and cognition. Trends Cogn. Sci. 4, 6–14 (2000)
Green, D. M. & Swets, J. A. Signal Detection Theory and Psychophysics (Wiley, New York, 1966)
Burgess, A. E. & Ghandeharian, H. Visual signal detection. II. Effect of signal-location identification. J. Opt. Soc. Am. A 1, 906–910 (1984)
Geisler, W. S. & Diehl, R. L. A Bayesian approach to the evolution of perceptual and cognitive systems. Cogn. Sci. 27, 379–402 (2003)
Kersten, D., Mamassian, P. & Yuille, A. L. Object perception as Bayesian inference. Annu. Rev. Psychol. 55, 271–304 (2004)
Field, D. J. Relations between the statistics of natural images and the response properties of cortical cells. J. Opt. Soc. Am. A 4, 2379–2394 (1987)
Irwin, D. E. Information integration across saccadic eye movement. Cognit. Psychol. 23, 420–458 (1991)
Hayhoe, M. M., Bensinger, D. G. & Ballard, D. H. Task constraints in visual working memory. Vision Res. 38, 125–137 (1998)
Rensink, R. A. Change detection. Annu. Rev. Psychol. 53, 245–277 (2002)
Palmer, J., Verghese, P. & Pavel, M. The psychophysics of visual search. Vision Res. 40, 1227–1268 (2000)
Wolfe, J. M. in Attention (ed. Pashler, H.) 13–74 (Psychology Press, Hove, East Sussex, 1998)
Schall, J. D. in The Visual Neurosciences (eds Chalupa, L. M. & Werner, J. S.) 1369–1390 (MIT Press, Cambridge, Massachusetts, 2004)
Blake, A. & Yuille, A. L. (eds) Active Vision (MIT Press, Cambridge, Massachusetts, 1992)
Burgess, A. E., Wagner, R. F., Jennings, R. J. & Barlow, H. B. Efficiency of human visual signal discrimination. Science 214, 93–94 (1981)
Pelli, D. G. & Farell, B. Why use noise? J. Opt. Soc. Am. A 16, 647–653 (1999)
Lu, Z.-L. & Dosher, B. A. Characterizing human perceptual inefficiencies with equivalent internal noise. J. Opt. Soc. Am. A 16, 764–778 (1999)
Geman, D. & Jedynak, B. An active testing model for tracking roads in satellite images. IEEE Trans. Pattern Anal. Mach. Intell. 18, 1–14 (1996)
Rajashekar, U., Cormack, L. K. & Bovik, A. C. in Eye Tracking Research & Applications (ed. Duchowski, A. T.) 119–123 (ACM SIGGRAPH, New Orleans, 2002)
Findley, J. M. Global processing for saccadic eye movements. Vision Res. 22, 1033–1045 (1982)
Zelinsky, G. J., Rao, R. P., Hayhoe, M. M. & Ballard, D. H. Eye movements reveal the spatio-temporal dynamics of visual search. Psychol. Sci. 8, 448–453 (1997)
Legge, G. E., Hooven, T. A., Klitz, T. S., Mansfield, J. G. & Tjan, B. S. Mr. Chips 2002: New insights from an ideal observer model of reading. Vision Res. 42, 2219–2234 (2002)
Eckstein, M. P., Beutter, B. R. & Stone, L. S. Quantifying the performance limits of human saccadic targeting during visual search. Perception 30, 1389–1401 (2001)
Eckstein, M. P., Thomas, J. P., Palmer, J. & Shimozaki, S. S. A signal detection model predicts the effects of set size on visual search accuracy for feature, conjunction, triple conjunction, and disjunction displays. Percept. Psychophys. 62, 425–451 (2000)
Itti, L. & Koch, C. A saliency-based search mechanism for overt and covert shifts of visual attention. Vision Res. 40, 11–46 (2000)
Rao, R. P. N., Zelinsky, G. J., Hayhoe, M. M. & Ballard, D. H. Eye movements in iconic visual search. Vision Res. 42, 1447–1463 (2002)
We thank R.F. Murray for helpful discussions, and J. Perry, L. Stern and C. Creeger for technical assistance. This work was supported by the National Eye Institute, NIH.
The authors declare that they have no competing financial interests.
This document contains a derivation of the ideal searcher for the case of dynamic (temporally uncorrelated) external and internal noise, and describes the ideal searcher for the case of static (temporally correlated) external noise and dynamic (temporally uncorrelated) internal noise. (PDF 181 kb)
About this article
Cite this article
Najemnik, J., Geisler, W. Optimal eye movement strategies in visual search. Nature 434, 387–391 (2005). https://doi.org/10.1038/nature03390
Scientific Reports (2021)
Human visual search follows a suboptimal Bayesian strategy revealed by a spatiotemporal computational model and experiment
Communications Biology (2021)
Nature Communications (2021)
Virtual Reality (2021)
Nature Communications (2020)