Optimal planning of eye movements

The capability of directing gaze to relevant parts in the environment is crucial for our survival. Computational models based on ideal-observer theory have provided quantitative accounts of human gaze selection in a range of visual search tasks. According to these models, gaze is directed to the position in a visual scene, at which uncertainty about task relevant properties will be reduced maximally with the next look. However, in tasks going beyond a single action, delayed rewards can play a crucial role thereby necessitating planning. Here we investigate whether humans are capable of planning more than the next single eye movement. We found evidence that our subjects’ behavior was better explained by an ideal planner compared to the ideal observer. In particular, the location of the first fixation differed depending on the stimulus and the time available for the search. Overall, our results are the first evidence that our visual system is capable of planning.

1 tasks, which rely on acquiring visual information for survival such as gathering 2 food, avoiding predators, making tools, and social interaction. As we can only 3 perceive a small proportion of our surroundings at any moment in time due to 4 the spatial distribution of our retinal receptor cells 1 , we are constantly forced to 5 bring task relevant parts of the visual scene into focus using eye movements 2 . 6 Thus, vision is a sequential process of active decisions. These decisions have 7 been characterized in terms of optimizing performance in the ongoing task 3-7 , 8 maximizing knowledge about the environment 8-10 , or targeting gaze towards 9 locations that are most salient 11 .

10
To understand the requirements of perceptual tasks, ideal-observer analy-11 sis 12,13 has been very successful based on the idea that visual perception is 12 inference of latent causes based on sensory signals 14,15 . In this framework, the 13 goal of the visual system is to use sensory data D to infer unknown properties of 14 the state s of the environment. For example, s could be indicating whether there 15 is a predator hiding behind a bush, and by directing gaze to the bush visual 16 data D about the latent variable describing the true state s of the environment 17 is obtained. This information can be incorporated into what is known about s 18 using Bayes' theorem P (s|D) = P (D|s)P (s)/P (D). Hence, the ideal observer 19 combines prior knowledge P (s) and sensory information P (D|s) to form an up- 20 dated posterior belief about environmental states relevant to the specific task. 21 The ideal-observer paradigm has been used successfully to understand how hu-22 mans choose locations for the next saccade. Specifically, human eye movements 23 use the current posterior and target the location where they expect uncertainty 24 about task relevant variables to be reduced most after having acquired new data 25 from that location in situations such as visual search 3 , face recognition 5 , and 26 temporal event detection 6 .

27
A limitation of ideal-observer theory is that performing sensory inference by 28 itself does not prescribe an action, i.e. information about s in the end needs 29 to be used to decide for an appropriate action, e.g. whether to flee. The costs 30 and benefits for the potential outcomes of the action can be very different, 31 e.g., not to flee if a predator is present is more costly than an unnecessary 32 flight. Bayesian decision theory provides such an answer by using the costs 33 and benefits of different outcomes with the respective uncertainties of the as-34 sociated outcomes. Hence, different potential outcomes of s are weighted with 35 a utility function U (a, s) to determine the action with highest expected util-36 ity: a = arg max a � s U (a, s)P (s|D)ds. Thus, it may be better to flee, even 37 when one is not absolutely certain that a predator is hiding behind a bush, be-38 cause the consequences may be particularly harmful. Interestingly, within this 39 framework, the optimal action targets the location where the next fixation will 40 reduce uncertainty the most and not the location that currently looks like the 41 most probable target location. Indeed, both explicit monetary rewards 16 and 42 implicit behavioral costs 6 in experimental settings have been shown to influence 43 eye movement choices. 44 However, Bayesian decision theory is limited to a particular subset of visual 45 tasks, namely tasks that do not involve planning. Repeatedly taking the action 46 with the maximum immediate utility in general may fail in tasks with longer 47 action sequences and delayed rewards depending on the specific task structure. 48 In these cases, an ideal planner based on the more powerful framework of belief 49 MDPs, which contains the ideal observer and the Bayesian decision maker as 50 special cases, is needed to find the optimal strategy. A Markov Decision Process 51 (MDP) 17,18 is a tuple (S, A, T, R, γ), where S is a set of states, A is a set 52 of actions, T = P (s � |s, a) contains the probabilities of transitioning from one 53 state to another, R represents the reward, and finally, γ denotes the discount 54 factor. In a belief MDP only partial information about the current state s is 55 available, therefore a probability distribution over states is kept as a belief state 56 b(s) = P (s | D) 19   Targets were only visible in close proximity to the current fixation location (i.e., inside the search area). (b) Procedure for a single trial. Subjects fixated a fixation cross either shown on the left or the right side, respectively. The shape appeared 750 ms prior to the start of the search. The search time was initiated by the participants' gaze crossing the dotted line. The line, however, was not visible to the subjects. Depending on the condition (short or long) subjects were able to perform one or two fixations inside the shape. (c) Raw gaze data is shown for a trial with short search time and initial fixation on the right side (upper panel) and for a trial with long search time and initial fixation on the left side (lower panel). Shapes were mirrored in a counterbalanced design to ensure equal orientation with respect to the initial fixation cross.
a belief state b(s) is denoted by the action-value function Q: where V * (b(s � )) is the expected future reward gained from the next belief state 59 b(s � ). Essentially, what this means is that the value of an action based on the 60 current belief is a combination of the immediate reward and the long term ex-61 pected reward, weighted by how likely the next belief is under the action. Thus, 62 as the belief about the state of task relevant quantities depends on uncertain 63 observations, actions are influenced both by obtaining rewards and obtaining 64 more evidence about the state of the environment.

65
In the present study we devised a task that allows probing whether ideal-66 observer models are sufficient to describe human eye movement strategies. For 67 our visual search task, we derived computational models based on ideal-observer 68 theory as well as on the framework of belief MDPs. Using these models, we 69 specifically created our stimuli such that the two models led either to differ-70 ent behavioral sequences or to the same behavioral sequences. The rationale 71 for this was to not only show the differences between ideal planer and ideal 72 observer but to also demonstrate that the solutions of both may lead to the 73 same action sequence, depending on the structure of the specific task. Using 74 this experimental paradigm we are able to test whether human eye movement 75 strategies follow the computational principles underlying ideal-observer theory 76 and sequential Bayesian decision making or whether the strategies are planned 77 and future rewards need to be considered (belief MDP).

79
Visual search as planning under uncertainty. To develop a computa-80 tional model of visual search as optimal planning under uncertainty it is first 81 necessary to specify the relevant quantities describing the task, i.e. the state 82 representation. In our visual search task (Fig. 1), a suitable candidate for 83 a state representation is the target location and the current location of gaze. 84 However, in general, the exact location of the target is unknown. Therefore, we 85 formalize the probability distribution of the target as a belief state. The action 86 space comprises potential fixation locations and with each action we receive in-87 formation about the target, update our belief and transition to the next belief 88 state. The reward function is an intuitive mapping between the belief state, 89 which comprises the knowledge about the location of a potential target, and the 90 probability of finding the target.

91
How should the actor decide where to look next according to this frame-92 work? A policy π is a sequence of actions and the optimal policy π * comprises 93 actions a = arg max a Q(b(s), a) that maximize the expected reward. In tasks 94 comprising sequences of actions, the optimal strategy, the ideal planner, incor-95 porates rewards associated with future actions (V * (b(s � )) into action selection. 96 As a result, the sequence of actions that leads to the maximum total reward is 97 chosen: 98 π * ideal planner = argmax a0,a1,...,an E[r 0 + γr 1 + · · · + γ n r n ] , where γ is the discount factor, which controls how much future rewards influence 99 the current action selection.

100
Ideal observer as special case of the ideal planner. If we are only in-101 terested in the optimal next action (γ = 0) or if there is only a single action to 102 perform equation (1) simplifies to: where is the posterior over relevant quantities in the task 104 and R(s, a) is the cost or reward function. Therefore, if reduced to the next 105 action alone, the ideal planner reduces to the ideal observer with an action 106 b Ideal Observer = Ideal Planner S1 S2 S3 S4  For the long search interval (right side, two fixations), the ideal observer and the ideal planner differ with respect to the scanpath. While the ideal observer's next fixation is chosen to maximize the immediate reward (better performance after the first fixation, bottom row), the ideal planner's scanpath is chosen to maximize performance after two fixations. Computational complexity (depicted as decision trees) is higher for the ideal planner as in the condition with long search intervals all two-fixation sequences are evaluated in order to maximize performance. (b) Shapes used in our visual search experiment. For each shape the optimal policy is shown for the ideal observer (pink) and the ideal planner (green). Whether these models lead to different strategies depends on the particular shape. Scanpaths are the same for Shapes S1 and S3, but differ for S2 and S4. selected to maximize task success after the next action. For sequences of actions, 107 the sequential application of the ideal-observer paradigm leads to the action 108 sequence: where a 0 , . . . , a n is the sequence of actions that yields the maximum expected 110 return r t for each time step t. Whether π ideal observer and π * ideal planner lead to 111 the same action sequence depends on the specific nature of the task. However, 112 in general: as can be seen in Fig. 2. Ideal-observer approaches only lead to optimal actions 114 if future rewards do not play a role, for example, if only a single action is 115 concerned.  The scanpaths suggested by the best fitting models for the ideal planner and the ideal observer are shown in the center and the right column, respectively. Again, solid lines depict the strategy for the long search interval, dashed lines for the short search interval. Global means of the human data are also shown for reference (red, green, and blue). (b) Actual and predicted spatial relation of first saccades for all four shapes. Graphs are centered at the fixation location in the short search interval condition. Arrows depict the displacement of the first fixation location in the long search interval relative to the short interval. Arrow color corresponds to the data source. For the ideal observer, the first fixation location is the same for both conditions (indicated by the square centered at (0,0)). (c) Difference in BIC between all tested models. The lower bound corresponds to a model directly estimating the mean fixation locations for each shape and condition from the data (3 × 4 means).
Surprisingly, all of the reviewed computational models for eye movements are 117 myopic, i.e. they choose actions that maximize the immediate reward 3,20,16,5-7 . 118 In practice, the problem of delayed rewards is circumvented by either inves-119 tigating only single saccades or by choosing tasks where both policies lead to 120 equivalent solutions. To our knowledge, there exist neither computational mod-121 els nor empirical data investigating whether humans are capable of planning 122 eye movements. The execution of eye movement sequences has been subject 123 to psychological research and results have shown that the latency of the first 124 saccade was higher for longer sequences of saccades 21 . Also, discrimination per-125 formance was enhanced at multiple locations within an instructed sequence of 126 saccades 22 . Further, if an eye movement plan was interrupted by additional in-127 formation midway the execution of the second saccade was delayed 23 . Although 128 these results indicate that a scanpath of at least two saccades is internally pre-129 pared before execution, no light is shed on whether multiple future fixation 130 locations are jointly chosen to maximize performance in a task.

131
Behavioral and model results. The mean fixation location for each par-132 ticipant separately for all shapes and conditions is shown in Fig. 3a. Also, 133 fixation sequences for the best fit of the ideal observer (right column) and the 134 ideal planner (center column) are depicted. Visual inspection suggests, that the 135 behavioral data is closer resembled by the results of the ideal planner. To test 136 whether eye movements were planned, we compared the first fixation location 137 in the short condition to the first fixation location in the long condition for all 138 shapes. If subjects were capable of performing planning, we expected a differ-139 ence in the first fixation location for Shape S2 and S4. We used Hotelling's 140 T-test to compare the bivariate landing positions of the first saccade between 141 the two search intervals (Supplementary Table 1). Indeed, mean target loca-142 tions for the first saccade were different in Shape S2 and S4. No significant 143 differences, however, were found in shapes S1 and S3. This behavior was well 144 predicted by our ideal planner, but not by the ideal observer. In addition, the 145 direction of the spatial difference of the first fixation location between the search 146 interval conditions followed the course suggested by our ideal planner (Fig. 3b). 147 Bounded actor extensions. We extended both the ideal observer as well 148 as the ideal planner to yield a more realistic model for human visual search 149 behavior, i.e. a bounded actor (see Materials). We added additive costs for 150 longer saccade amplitude (as they lead to longer scanpath duration 24 and higher 151 endpoint variability 25 , which humans have been shown to minimize 26 ), used 152 foveated versions of the shapes to account for the decline of visual acuity in 153 peripheral vision 27 , and accounted for the often reported fact, that human sac-154 cades undershoot their target 28,29 . We used the sum of squared errors between 155 our model prediction and our data to compute the BIC for each model. Figure 156 3c shows the difference in BIC of all models compared to the best model. The 157 lower bound was derived by computing the mean fixation locations directly from 158 the data (3 × 4 parameters). The difference in BIC values between two models 159 is an approximation for the log Bayes factor and a difference ΔBIC > 4.6 is 160 considered to be decisive 30 . Results clearly favor the ideal planner over the 161 ideal observer (ΔBIC = 138). Crucially, the ideal planner without any param-162 eter fitting still provided a better description of our human data than the ideal 163 observer with all extensions (ΔBIC = 27). Further, all model extensions did 164 not only improve our model fit for the ideal planner but were favored by model 165 selection, suggesting that they are needed for describing the eye movement data 166 in our experiment (ΔBIC = 11 between ideal planner with all extensions and 167 ideal planner without undershot).

168
Parameter estimates for the saccadic undershot were similar for the ideal 169 observer (4.14 %) and the ideal planner ( 5.07 %). The influence of the costs for 170 longer saccades was higher for the ideal observer (1.2 DP / Deg) compared to 171 the ideal planner (0.55 DP/Deg). The unit of the costs is detection performance 172 (DP) per degree (Deg) and states, how much performance subjects were willing 173 to give up to shorten saccade amplitudes by one visual degree. We also estimated 174 the radius of the circular gaze contingent search shape centered at the current 175 fixation. Parameter estimation yielded values very close to the true radius 176 and did not improve model quality for neither the ideal planner nor the ideal 177 observer.

179
It has been unclear whether sequences of human eye movements are planned 180 ahead in time. Prior studies indicate that multiple saccadic targets are jointly 181 prepared as a scanpath and that cueing new targets during execution of eye 182 movements results in longer execution times [21][22][23] . However, to our knowledge 183 there has been no experimental evidence that eye movements are chosen by 184 considering more than one step ahead into the future. Instead, the ideal-observer 185 paradigm, that models human eye movements as sequential Bayesian decisions 186 has been the predominant approach.

187
In our study we tested whether the implicit assumptions that accompany the 188 ideal observer are justified. Therefore, we contrasted the ideal observer with the 189 more general ideal planner that was formalized as a Markov Decision Process 18 190 with partially observable states 19 . We formalized policies for the ideal observer, 191 only considering the immediate reward for action selection, and for the ideal 192 planner, which also considers future rewards. Next, we derived the specific cir-193 cumstances under which the models produce different policies. Ultimately, we 194 used these insights to manufacture stimuli that maximized the behavioral differ-195 ences elicited by the different cognitive strategies and also obtained stimuli that 196 show very similar strategies. Thus, the resulting stimuli were highly suitable for 197 examining which cognitive strategy was adopted by our subjects. 198 We developed a visual search task where we expected different behavioral 199 sequences depending on the cognitive strategy of our subjects. In particular, 200 we investigated whether subjects adjust their scanpath during visual search 201 dependent on the duration of the search interval. Therefore, we controlled the 202 length of the saccadic sequence. The short search interval allowed subjects to 203 execute a single saccade, while in the long search interval subjects were able to 204 fixate two locations. 205 Our results suggest that eye movements are indeed planned. Subjects' scan-206 path was very well predicted by the ideal planner while showing severe deviations 207 from the scanpath proposed by the ideal observer. Crucially, this was the case 208 even if the sequence required planning. We found fixation locations to be dif-209 ferent depending on the duration of the search interval. This difference is only 210 expected under the ideal planner and can not be explain by the ideal observer. 211 Finally, model comparison favored the ideal planner and its extensions over the 212 ideal observer by a large margin. Furthermore, extending our ideal planner 213 model to a bounded planner, we found evidence that subjects traded off task 214 performance and saccade amplitude. Including additive costs for saccades with 215 great amplitude into the ideal planner and accounting for saccadic undershot 216 was best capable of explaining our data further.

217
Finding and executing near optimal gaze sequences is crucial for many ex-218 tended sequential every-day tasks 31,32 . The capability of humans to plan be-219 havioral sequences gives further insights into why we can solve so many tasks 220 with ease, which are extremely difficult from a computational perspective. In 221 many visuomotor tasks coordinated action sequences are needed rather than 222 single isolated actions 33 . This leads to delayed rewards and thus a complex pol-223 icy is required rather than an action that directly maximizes the performance 224 after the next single gaze switch. Additionally, our findings have implications 225 for future models of human eye movements. While numerous influential past 226 models have not taken planning into consideration 3,5,6,20 , our results indicate 227 that in the case of visual search humans are capable of including future states 228 into the selection of a suitable scan path.

229
The broader significance of the present results beyond the understanding of 230 eye movements lies in the fact that human behavior in our experiment was best 231 described by a computational model of a bounded probabilistic planning under 232 perceptual uncertainty algorithm. In this framework, sensory measurements 233 and goal directed actions are inseparably intertwined 34,35 . So far, the predom-234 inant approach to probabilistic models in perception has been the ideal ob-235 server 12,13 , which can be formalized in the Bayesian framework 14,15 as inferring 236 latent causes in the environment giving rise to sensory observations. Models of 237 eye movements selection have so far used ideal observers 3,5,6 without planning. 238 Probabilistic, Bayesian formulations of optimality in perceptual tasks 36,37 , cog-239 nitive tasks 38,39 , reasoning 40 , motorcontrol 41 , learning 42 , and planning 43 have 240 lead to a better understanding of human behavior and the quest to unravel, how 241 the brain could implement these computations [44][45][46] , which are known in general 242 to be intractable 47   Task. In our task subjects searched for a hidden target within irregularly bounded shapes 257 (Fig. 1a). Using a gaze contingent paradigm the hidden target only became visible if a 258 fixation landed close enough (||p Fix − p Tar || < 6.5�). The search area was made explicit by 259 showing the shape's texture for all points closer than 6.5� to the fixation location. Targets 260 within that area became visible to the participant after a delay of 130 ms. This was done 261 to prevent participants from sliding over the image and instead encourage them to perform 262 distinct fixations. Texture was chosen to reinforce the feeling of looking through the shape 263 (subjects were told to imagine wearing x-ray goggles).

264
A single trial was as follows (Fig. 1b): Participants fixated a fixation cross that was 265 randomly presented either on the left or the right side of the screen. After 1 s the shape was  Materials. Our computational models enabled us to specifically select shapes that facilitate 279 testing our hypothesis. In particular, we identified stimuli that triggered different policies 280 for the ideal-observer model and the ideal-planner model. First, multiple candidates shapes 281 were generated using the following approach: Five points were drawn uniformly in a bounded 282 area (23.24�× 23.24�). Next, a B-spline was fitted to the random points. Finally, the shapes 283 bounded by the splines using the fitted parameters were filled with a texture (white noise). We 284 applied both models to identify shapes that lead to different policies. Overall, four different 285 shapes were used in the experiment (see Fig. 2b). We chose two shapes where optimal behavior 286 requires planning (S2 and S4) and two where it does not (S1 and S3), i.e. where the sequence 287 of eye movements from the ideal observer and the ideal planner coincide. In each category 288 we selected two shapes by visual inspection ensuring that they were similar with respect to 289 the area covered. For display during the experiment the shapes were upscaled with a factor 290 of 1.5 and centered on the monitor such that the center of the shapes bounding box matched 291 the center of the screen.

292
The target was a circular grating stimulus (0.87�in diameter). Contrast was set in a way 293 that it was easily detected if it was within the visible search radius of the current fixation.

294
The target's position was generated by randomly choosing a location within the shape.  If necessary the search time was adjusted (between 500ms and 580ms, for the long search 301 interval). Participants were encouraged to ask questions if anything was unclear. After train-302 ing, participants answered ten questions from a checklist to ensure that they understood the 303 task properly (e.g., when does the search interval start and how many targets can be found 304 at most). Incorrect answers were documented and the correct answers were discussed. Af-305 ter successfully finishing the training, four blocks each containing 100 trials were performed. 306 Thereby, the order of the blocks was either SSLL (two blocks with short search time followed 307 by two blocks with long search time) or LLSS. Participants were randomly assigned to one of 308 the two orders. Eye tracking calibration was renewed before each block.  Model. Here we derive expressions that implement the general mechanisms of equation (2) 336 and (4) for our visual search task. According to our experimental design participants directed 337 their gaze to suitable locations within a shape in order to decide if a target was present.

338
Depending on the condition, the action sequence in our task comprised one (short condition) 339 or two (long condition) fixation locations. Formally, the greedy policy of the ideal observer 340 (equation (4)) leads to the sequence of fixation locations (x 0 , y 0 ), (x 1 , y 1 ), . . . , (xn, yn) that 341 maximizes the quality of the decision after each step. In the case of two fixations this leads 342 to: where xn, yn are the coordinates of nth fixation location and P (correct|xn, yn) denotes the 344 probability of deciding correctly whether a target is present after the nth fixation.

345
The non greedy policy of the ideal planner can be derived from equation (2) in a similar 346 fashion. Again, we consider the case of two fixations (LI). Here, the next fixation location is 347 determined by maximizing the reward simultaneously using the next two fixation locations: 348 π ideal planner := argmax Thereby, (x 0 , y 0 ) is the next location and (x 1 , y 1 ) is the location thereafter. By jointly op-349 timizing the entire sequence of fixation locations the ideal planner is always equal or better 350 compared to the ideal observer. Intuitively, π ideal observer and π ideal planner yield the same 351 action sequence if the sequence only contains a single action, i.e. a single fixation. Also, the 352 first fixation location of ideal observer is the same for both conditions. Crucially, this is not 353 the case for π ideal planner . By jointly maximization the reward over the whole action sequence, 354 even the first fixation location can differ between the conditions.

355
Next, we derive the probability of a correct decision given a sequence of fixation locations 356 since both proposed policies depend on the performance in the task, i.e., the detection proba- where P T (x, y) is the probability that the target is located at (x, y) and P O (x, y|xn, yn) is the 360 probability that the location (x, y) is covered by the search given that the saccade was targeted where the threshold is equal to the radius of the search area (6.5�).

365
Model extensions. To take into account known cognitive and biological constraints we need 366 to incorporate several well known characteristics of the human visual system. We introduced 367 costs on the saccade amplitude thus favoring smaller eye movements. As was shown by