The psychophysics of human three-dimensional active visuospatial problem-solving

Our understanding of how visual systems detect, analyze and interpret visual stimuli has advanced greatly. However, the visual systems of all animals do much more; they enable visual behaviours. How well the visual system performs while interacting with the visual environment and how vision is used in the real world is far from fully understood, especially in humans. It has been suggested that comparison is the most primitive of psychophysical tasks. Thus, as a probe into these active visual behaviours, we use a same-different task: Are two physical 3D objects visually the same? This task is a fundamental cognitive ability. We pose this question to human subjects who are free to move about and examine two real objects in a physical 3D space. The experimental design is such that all behaviours are directed to viewpoint change. Without any training, our participants achieved a mean accuracy of 93.82%. No learning effect was observed on accuracy after many trials, but some effect was seen for response time, number of fixations and extent of head movement. Our probe task, even though easily executed at high-performance levels, uncovered a surprising variety of complex strategies for viewpoint control, suggesting that solutions were developed dynamically and deployed in a seemingly directed hypothesize-and-test manner tailored to the specific task. Subjects need not acquire task-specific knowledge; instead, they formulate effective solutions right from the outset, and as they engage in a series of attempts, those solutions progressively refine, becoming more efficient without compromising accuracy.


I. INTRODUCTION
Human visual ability feels so effortless that it is literally taken for granted.However, any intuitive introspection into this ability rarely reveals its profound nature.On the contrary, the tendency has been to prefer simpler descriptions in an Occam's Razor (or parsimony) sense, and although these have helped move our understanding along, they are no longer as useful.A main focus has been on how well the eyes and visual system can detect a target stimulus, i.e.Visual Function [1].This has been studied extensively over many decades, in several literatures (in computer vision, in ophthalmology, in visual psychophysics).The visual systems of all animals do much more than simply detect stimuli; they also enable locomotion, seek food, detect threats, guide mating and rearing of offspring.In humans, there is even more than supporting basic survival.We use our visual abilities to create, exploit, admire, destroy, and manipulate our 3D physical world.How well we perform while interacting with the visual environment and how vision is used in everyday activities has been termed Functional Vision [1].In contrast with visual function, this has not been well examined, in part due to its inherent experimental difficulty and to the fact that it seems far more complex an activity to define.Research of this kind has recently become possible in a variety of animals [2], [3], [4].We report a first detailed and precise investigation into human functional vision.
Although it seems easy to enumerate the many kinds of behaviours humans perform in their visual world, it is far less easy to know how to best probe the nature of such behaviours in a general sense.A classic example is seen in the well-known experiments of Sheppard & Metzler in 1971 [5] where they showed subjects, seated in front of a video monitor, 2D projections of unknown 3D geometric objects (Figure 1A).Subjects were instructed to determine if the two objects were the same or different.Note how self-occlusion is a natural characteristic of their objects.Their results showed that subjects seemed to mentally rotate objects in order to test potential geometric correspondence.This work was seminal in this area but seemed difficult to generalize if viewing real 3D objects in an unconstrained manner.Nevertheless, their experiment provided the inspiration for many important investigations as well as for what we present here.In a more foundational manner, the act of comparison has been suggested as the most primitive psychophysical task [6]; efforts to discover the deep nature of real-world visual comparison behaviour may have an impact on the full spectrum of human visual behaviour.
How do humans decide if two objects are the same or different (as Sheppard & Metzler asked) within a physical 3D environment where they are free to explore those objects?Where do they look?How do they look?How do they move?Such questions are central to this study.Figure 1B shows an example of one common such setting within a sculpture gallery: are these two sculptures the same?This may be easy, or it may be difficult to determine.What is clear is that a single image of the objects generally will not suffice.Suppose we consider 3D versions of Sheppard & Metzler's unknown geometric objects.We could construct a range of such objects of varying physical complexity and show them to subjects.They would be free to examine without touching them and free to move around the objects, just like in the sculpture gallery.This requires a means to precisely record exactly what subjects are looking at and from what viewing position while deciding the answer to the simple question: are two objects the same or different?Figure 1C gives a sketch of our experiment showing a human subject wearing the recording device.
With this question in mind, we have designed a novel set of stimuli with known geometric complexity.We also employ a first-of-its-kind experimental setup that allows for precise, synchronized tracking of head motion with 6 degrees of freedom and eye gaze while subjects are completely untethered to allow natural task execution.We have collected detailed data on humans performing a visuospatial task in hundreds of experiments and present an in-depth analysis with respect to various cumulative performance metrics such as the number of fixations, response time, amount of movement, and learning effect.We are not restricted to summary statistics of single actions such as these, but we also examine sequences of actions that reveal a fascinating array of problem-solving strategies employed differently by each subject and for each test case.
We also challenge the current intuitive views that perhaps such a visual ability can be captured by a large dataset of trials that feed a sophisticated learning method or that clever use of visual saliency maps might be the key, or that novel policies for a reinforcement learning approach might suffice.Our expectation is that this, and related non-trivial tasks, need something different.Our experimental goal is to reveal what human solutions to such 3D visuospatial tasks actually involve.
To analyze the effect of stimulus complexity, our set of objects consists of 12 objects divided into three complexity levels each of which can be presented at an arbitrary 3D pose.How each object appears (the image it projects to the subject) changes, sometimes radically, with change in viewing position.As a result we include Fig. 1: Our experiment as it evolved, beginning with Sheppard & Metzler's inspiration, the reality of visuospatial problem-solving "in the wild", and our actual setup.A) The study of human vision often involves a two-dimensional probe, i.e., visual function.As illustrated here, the subject sits at a desk with head stabilized in a forehead and chin mount while performing an experiment involving images on a screen.B) Real-world visuospatial problem-solving, i.e. functional vision, however, is distinctively different.Here we show someone in a sculpture gallery wondering if these two marble heads are the same.Humans do not passively receive stimuli; rather, they choose what to look at and how.We move our head and body in a threedimensional world and view objects from directions and positions that are most suited to our viewing purpose.It should be clear that answering our sculpture query here might require more than one glance.It is we who decide what to look at, not an experimenter.C) Our experimental setup is as shown.A subject wears a special, wireless headset and is shown two objects mounted on posts at certain 3D orientations.The subject is asked to determine, without touching, if the two objects are the same (in all aspects) or different and is completely free and untethered to move anywhere they choose.Gaze and view are precisely recorded (See Methods & Materials).not only the object poses as experimental variables but also the subject's initial viewing position.To keep the experiment tractable we investigate three object pose orientations and three viewer starting positions.Every subject performed 18 trials with randomly selected experimental configurations to test if a learning effect exists.
Surprisingly, humans have virtually no difficulty with this task, even for hard cases.The accuracy ranged from 80-100% across all configurations.Much data acquisition occurs with a minimum of 6 and a maximum of 800 eye fixations.Interestingly, no statistical change was observed in accuracy throughout the trials.However, a learning effect was seen for the number of fixations on the objects, response time, and head movement.The sequence of actions we observed strongly suggests that human problem-solving strategies are dynamically determined and deployed in a seemingly directed hypothesize-and-test manner tailored to the particular task instance at hand.Subjects do not need to learn the task; they develop good solutions from the start, and over the set of trials, those solutions become smoother or more efficient while maintaining accuracy.

A. Stimuli and Task
The task, including the stimuli, is illustrated in Figure 1C and is designed to be a two-alternative forced choice.Subjects were allowed to move within a constrained area of about 3.4m by 4.3m, presented with two static three-dimensional stimuli mounted on acrylic posts.The task was to determine whether the two stimuli were the same or different.Sameness in our experiment is defined as geometric congruence -all stimuli share the same colour and surface texture.
The stimuli are part of a three-dimensional physical objects set called TEOS [7].The objects are inspired by the stimuli of Sheppard & Metzler.TEOS objects are all three-dimensional and have a known geometric complexity.Furthermore, a shared common-coordinate system allows quantifying the orientational difference of two objects.
An illustration of the objects is shown in Figure 2A.The set contains twelve objects split equally into three different complexity levels, which is defined by the number of blocks used to build an object.The level of object complexity C will be indicated by the subscript, such as Ce, Cm, and C h for easy, medium, and hard, respectively.TEOS objects are 3D printable.The objects are roughly 12cm × 14cm × 18cm in size.Movements of the subject were not restricted, and no time constraints were given.However, a definite answer (same, different, i.e. 2AFC) must be given to end the trial.Each subject performed 18 trials evenly split among complexity levels.We also investigated different starting positions as they determine the initial observation of the objects.Figure 3A illustrates a top view of the experimental space marking the three positions from which a trial can start: equidistant from both objects (P l ), in line with both objects (Ps) and oblique to both (Pc).Furthermore, we looked at the effect of object orientation difference.We limited the large space of possibilities to three values of orientation difference, 0 • , 90 • , and 180 • .For all trials, we have used the same three poses as illustrated in Figure 3B.Furthermore, all experimental variables were selected randomly for each trial, including sameness, complexity, starting position, and object orientation.
Lastly, after the experiment, subjects answered questions about their approach to solving this task (e.g."what was your strategy for approaching the task", "did you notice any changes in your approach throughout the trials", "which instances were more challenging than others and why").

B. Data Acquisition and Analysis
We created a novel active vision experimental facility (named PESAO -Psychophysical Experimental Setup for Active Observers [8]) which formed the basis for all data acquisition and analysis.Its primary components are: Eye tracking glasses to capture the gaze direction, a motion tracking system for head tracking, 1 st and 3 rd person video and homogenous lighting set up.Specialized software synchronized and aligned data streams from each source at microsecond precision [8].To track the position and orientation of the stimuli, we developed motion-tracking markers that were attached to the stand of the objects.The subject's head motion was tracked using a custom tracking body attached to the eye-tracking glasses.Figure 2B shows the custom clip-on equipment and Figure 2C shows a photo of the assembly with the glasses.The tracking frequency for objects and head motion was 120 Hz, and for the eyes 50 Hz.The accuracy for the motion tracking system was ≈ 0.2mm (RMSE) in 97% of the capture volume.The gaze was tracked with 1.42 • mean accuracy.Lastly, our statistical significance analysis is performed using oneway repeated-measures ANOVA.

C. Subjects
47 participants randomly sampled from the general public took part in our experiment.The average age was 23.4 years, ranging from 19 to 52 years of age.All subjects had normal, or corrected-to-normal vision, granted informed consent and were paid for participation.The experiment was approved by the office of research ethics at York University (Certificates #2020-137 and #2020-217).

III. RESULTS
In total, we conducted 846 trials.Each subject performed 18 trials, sampling each configuration of experimental variables more than 15 times.We recorded about 80, 000 fixations with over 4.5M head poses and 11 hours of footage of 1 st and 3 rd person video each.A visualization of a trial is shown in Figure 4, visualized using the graphical functions of PESAO.The subject required 61 fixations, moved a total of 18.75 m to complete the task, and answered correctly (same).
We next present observations on accuracy, number of fixations, response time, movement, and fixation patterns.

A. Accuracy
Humans are remarkably good at this task.Throughout all configurations, participants achieved an absolute mean accuracy of 93.83%, σ = 3.9% (Figure S1, Supporting Information (SI)).The bestperforming configuration was with stimuli of Ce, starting position P l and a difference in object orientations of 0.0 • .Not a single trial of this configuration was answered incorrectly, regardless of the object sameness.Object complexity plays a relevant role in how well participants performed this task.Objects of Ce complexity yield an Fig. 4: A visualization of the recorded data from PESAO.The movement of the subject is plotted as a dashed line in white, and fixations on either object are illustrated as a frustum in the corresponding colour of the fixated object.Selected fixation frusta are annotated with snapshots of the subject's first-person view and the gaze at a particular fixation (red circle).In this example, the objects are the same, of complexity level C h , they differ in pose by 180 • , and the subject started from position Ps.average accuracy of 96.1%, Cm objects with 94.18%, and C h with 91.2%.
The sameness of objects has a significant effect on accuracy (F1,46 = 3.58, p = 0.044, see Figure S1 c), SI).If the objects are the same, in general, a higher accuracy is seen (94.3%) than for different pairings (91.6%).
We were also interested in investigating the effect of the starting position (Figure S1 a), SI).While for the Ce, the best mean performance was observed from Pc, for Cm, the best performance was seen starting from Ps, and finally, for C h , starting from P l achieved the highest accuracy.While the worst performances varied between Ps and Pc starting position, P l did seem to result, generally, in higher accuracy.However, no significant effect of the starting position with respect to accuracy was observed (F2,92 = 2.11, p = 0.125).
An investigation of the object orientation with regard to accuracy yields the following observations (Figure S1 b), SI).For the Ce case, there is a clear gradient of accuracy following the increase of orientation difference.Notably, trials of Ce objects with orientation 0 • had an accuracy of 100%.However, for the other complexity levels, a different pattern can be identified; 90 • was most accurately identified with 94.82% and 90% for Cm and C h , respectively.0 • and 180 • ranked second and third.Interestingly, the object orientation does not have a significant effect on the accuracy (F2,92 = 2.06, p = 0.132).
Every subject performed 18 trials, and no target object configurations were repeated.Nevertheless, we expected that some improvement in accuracy would begin to appear.Surprisingly, this was not the case; no significant learning effect was observed (F2,92 = 0.88, p = 0.414), see Figure S1 d), SI.

B. Number of Fixations
A substantial amount of data acquisition occurs during a solution to this task, as subjects used a minimum of 6 different eye fixations while averaging 92.38 across all trials.The object complexity plays a role in how many fixations are required to solve this task.Ce objects required about 66 fixations, Cm 69 fixations, and C h 102 fixations on average.The effect is statistically significant (F2,92 = 32.15,p < 0.0001), see Figure S2 e), SI.
The evaluation of sameness against the number of fixations revealed two major insights (Figure S2 c), SI).Firstly, the same pairings always required significantly more fixations than different pairings (F1,46 = 7.78, p = 0.00761).Secondly, the same pairings needed at least 10, in some cases up to 20, fixations on average more.Furthermore, error responses required significantly more fixations than correct answers (F1,46 = 9.762, p < 0.003).
Cm and C h cases, starting from Ps resulted in the most fixations on average, followed by starting from the Pc and P l (Figure S2 a), SI).The starting position, similar to the accuracy of answering correctly, does not have any effect with respect to the number of fixations (F2,92 = 1.37, p = 0.258).
In terms of object orientation, (Figure S2 b), SI), orientations 0 • and 90 • are similar, varying only a few fixations for the median and upper and lower quartile.In terms of absolute values, a few trials of C h and orientation of 0 • required about 800 fixations.Notably, these trials started from P l .In summary, larger orientation differences required significantly more fixations regardless of object complexity (F2,92 = 8.31, p = 0.00048).
Notably, a significant learning effect with respect to the number of fixations is observed (F5,230 = 3.239, p = 0.0075).This means that participants require fewer fixations (Figure S2 d), SI), hence solving the task more efficiently but not more accurately, as the trials progress.

C. Response Time
The response time is the time elapsed from the first fixation of the trial to the time when the subject provided the answer.On average, the response time was 47.52s (σ = 30.39).Among all trials, the shortest response was for an Ce level, starting from Ps, with 180 • orientational difference and only taking 4.2s.The longest response time was recorded for a C h level, starting from P l and required 298s.
The complexity of the stimuli affects the response time gradually for Ce (on average 40.03s), and Cm (on average 42.01s) cases, and distinctively for C h (on average 60.53s) -increasing object complexity also means a significant increase in response time (F2,92 = 28.87,p < 0.0001), see Figure S3 e), SI.Furthermore, the response time is approximately a linearly increasing function of the angular difference of the objects; Ce (n = 7, ∆t = 5.72s), Cm (n = 10, ∆t = 4.2s), and C h (n = 18, ∆t = 3.36s), where n is the number of elements used to create the object and ∆t is the normalized response time with respect to a single element (Figure S3 b), SI).This relates well to Sheppard & Metzler conclusions, but in our 3D active setting.
Similarly to the number of fixations required, the sameness of the stimuli has a distinct effect on the response time (F1,46 = 14.279, p = 0.0004), see Figure S3 c), SI -same cases take significantly longer than different ones.
Consistent with other measures, the response time is not significantly affected by the starting position (F2,92 = 0.12, p = 0.886), see Figure S3 a), SI.The object orientation, however, does affect response time (F2,92 = 12.95, p < 0.0001).In general, a lesser orientation difference also means a quicker response time -0 • was answered the quickest, followed by 90 • , and 180 • , see Figure S3 b), SI.
Subjects seem to develop more efficient strategies with increasing trials completed.Starting at about 47s (Mdn.) at the first trial, the response time drops to about 34s (Mdn.) for trials two to four and drops further to 29s Mdn. at trials five and six.For C h cases, a drop from the first trial (70s Mdn.) to the second trial (about 50s Mdn.) can be seen (Figure S3 d), SI).Overall, looking at the impact of progressing trials and their response time, a significant effect is noticed (F5,230 = 6.01, p = 0.0003).

D. Movement
The mean of head movement was 16.62m.The amount of head movement slightly increased from complexity cases of Ce to Cm but increased more distinctly for C h cases -the object complexity significantly affects the amount of head movement (F2,92 = 35.35,p < 0.0001, Figure S4 d), SI).
Aligned with the number of fixations, response time and accuracy, the amount of movement is greater for the same object pairings across all complexity levels (F1,46 = 31.37,p < 0.0001), see Figure S4  c), SI.For different cases, the increased upper and lower quartiles indicate that more uncertainty across different subjects in how to approach this case was involved.
We found no relationship between amount of head movement and starting position (Figure S4 a), SI).However, there exists a significant effect on the correctness of the answer and head movement (Figure S4 e), SI).Error responses were accompanied by significantly more head movement (F1,46 = 41.56,p < 0.0001).
A clear trend (F2,92 = 22.74, p < 0.0001) can be observed between the amount of head movement and the amount of orientational difference (see Figure S4 b), SI); at 0 • the least amount of movement was required, at 90 • an increase of 2-5m on average is recorded, and at 180 • an additional increase of 1-5m across all complexity classes is recorded.
A significant reduction in head movement is noticable over the course of the trials (F5,230 = 5.403, p = 0.0001).This means that participants show a learning effect in the sense that they execute a strategy with less head moment as the experiment progresses.For Cm and C h cases, a trend is not visible directly, but for the Ce cases, it is (Figure S4 d), SI).C h cases start off at the first trial with just above 20m and drop to the absolute mean value of 16.62m and stay steady, marginally falling below and exceeding it repetitively; similarly, for the Cm case, where no learning trend can be observed.However, the Ce case, while noticing a slight up-trend for the second trial, consecutively decreases from about 16m down to about 10m, which is the equivalent of an improvement of 37.5%.
Lastly, in Figure S5, SI, we plot normalized measured variables (accuracy, amount of fixations, response time, and movement) against trial number.It is easily seen that every variable improves over the trials except for accuracy.

E. Fixation Patterns
Lastly, we looked at fixation patterns.We considered the ratio of fixations landing on each object respectively and fixation groupings, specifically fixations falling on the same object before shifting away.
To evaluate the fixation ratio, we looked at the total number of fixations for either object.The object with the most fixations is considered the primary object, and the object with fewer or the same number of fixations is the secondary object.On average, the primary object accounted for 59.53% of fixations, and the secondary object for 40.47%.
However, for some configurations, a difference in fixation ratio is noticeable.For instance, the difference in the average number of fixations with respect to object orientation decreases with increasing orientational difference (Figure S6 c, SI).
The sameness of the object had largely no significant effect on fixation groups (Figure S7 d, SI).Only single (F2,92 = 5.25, p = 0.02) and septuple (F2,46 = 13.7,p = 0.0005) groups are more significantly used for the same objects than different ones.As trials advanced, subjects used single groupings progressively less (F2,92 = 1.33, p = 0.25).A similar trend is observed for couple, triple, quadruple and quintuple groups -none are significant, however.Larger groupings see an increase in probability as trials proceed.Notably, only a few sparse data points are recorded for octuple pairings up to trial 8. Octuple pairings are more regularly seen for trials 9-19 (Figure S7 e, SI).
The fixation groupings reveal a purposeful gaze toward the solution of 3D visuospatial problems.Further analysis and experiments are hoped to flesh out behaviours comprising human functional vision.

IV. DISCUSSION
The goal of this study was to examine functional vision in human subjects, specifically, how they solve a visuospatial problem in a three-dimensional space.
We addressed this by developing a three-dimensional version of the well-known same-different task as a probe and an experimental setup allowing natural, visual problem-solving and precision recording.Such physical object comparisons seem a fundamental cognitive ability [9].
Our main results follow.People are very good at this task, even in difficult cases.No training trials were required.The range of response times from simplest to most complex cases ranges from 4 to 298 sec.and accuracy from 80% to 100%.A great deal of data acquisition is occurs during all trials with the range of eye movements (separate fixations and separate images processed) from 6 to 800 fixations.
Furthermore, we showed that not only multiple fixations are required, but also multiple fixations in sequence on the same stimulus.Only about 20% of all fixations are single fixations, and fixation groups get larger and more frequent with increasing levels of complexity and orientation.These groups seem to develop throughout the course of the trial.Simpler groupings (single, couple, triple) are replaced by more complex ones (septuple, octuple, larger) as the trial progresses.This hints that subjects use what they know to dynamically compose visuospatial strategies.In our analysis of fixation ratios, subjects did not simply observe each object with the same number of fixations.They chose one object as their primary object (59.53% of total fixations, regardless of experimental setup) and spent just about 40% of fixations on the secondary to solve this problem -brute-force approaches would have averaged a 50 : 50 ratio leading to the conclusion that subjects did not use random or uninformed search strategies.Could they be building internal models of the objects, which are then compared?This possibility needs further investigation.In Figure S8, SI, we illustrate a set of fixations observed in a trial with Cm complexity, same objects, starting from Pc, presented at 90 • .This pattern looks like an observation strategy; the gaze goes back and forth between stimuli.Interestingly, the fixations land on similar-looking areas.It appears that the subject compares object features.
No statistical change was observed in accuracy with increasing trials for individual subjects.However, a change was observed in the number of fixations, response time, and amount of head movement.This is surprising, as the accuracy did not significantly change throughout the trials and shows that the set of visuospatial problemsolving techniques, innate and learned through a lifetime, generalized well to this specific task.However, the decrease in the number of fixations, response time, and amount of head movement shows that participants may have fine-tuned them for efficiency.This learning effect seems counterintuitive, especially when compared to modern computational attempts at active learning (however, see [10] for review and a promising change).
Our work also shows consistency with the classic version of the same-different task [5] in that the time required to determine if two perspective drawings portray objects of the same three-dimensional shape is found to be a linearly increasing function of the angular difference in the portrayed orientations for the two objects.
Other three-dimensional stimuli were considered.Some examples are the well-studied greebles families [11], the CLEVR data set [12], or T-LESS [13].All are virtual but could be physically created, for instance, using fused manufacturing modelling.Various versions of the greebles objects have been introduced.These objects would function well as a stimulus for this experiment as they are textureless, and a common-coordinate system could be defined easily (greebles are all structured similarly).However, different greebles appear quite different and would make the same-different task trivial.The CLEVR data set does use simple blocks to build the stimulus, similar to TEOS; however, neither a systematic measure for self-occlusion nor a common coordinate system to define the object pose exists.T-LESS objects do have an associated pose and also have a textureless appearance, but the objects are easily differentiable.TEOS combines crucial properties to discover any patterns in solving the samedifferent task; novel and unfamiliar, textureless appearance, known complexity, common coordinate system, and varying self-occlusion.
Humans use vision for a vast array of behaviours in the real world; visuospatial intelligence is much more than simply detecting a stimulus or recognizing an object or scene.Unfortunately, past methodologies have limited studies beyond these, and the fundamental questions of three-dimensional visuospatial intelligence in the real world remain.Where do we look?How do we look?How do we move?How do we seek out the data that enables problems to be solved?The first steps towards these answers are presented along with an experimental infrastructure appropriate for many further studies.
Our data shows that we actually do a great deal of 'looking' in order to solve a problem.Adult humans do not need to learn where to look and exhibit an array of complex fixation patterns.We also move about as we look, most likely because we choose what we wish to see and from what vantage in order to support the task at hand.Although the use of our eyes may seem effortless, the complexity of actions is staggering and unravelling their purpose -the 'why' behind all this looking -poses an exciting challenge.Visualizes the effect of the progression through trials.A significant reduction in head movement is noticable over the course of the trials.This means that participants show a learning effect in the sense that they execute a strategy with less head moment as the experiment progresses.For Cm and C h cases, a trend is not visible directly, but for the Ce cases, it is.C h cases start off at the first trial with just above 20m and drop to the absolute mean value of 16.62m and stay steady, marginally falling below and exceeding it repetitively; similarly, for the Cm case, where no learning trend can be observed.However, the Ce case, while noticing a slight up-trend for the second trial, consecutively decreases from about 16m down to about 10m, which is the equivalent of an improvement of 37.5%.Lastly, (e) shows the effect of correct/error responses.However, there exists a significant effect on the correctness of the answer and head movement.Error responses were accompanied by significantly more head movement.We looked at the total number of fixations for either object.The object with the most fixations is considered the primary object, and the object with fewer or the same number of fixations is the secondary object.Interestingly, subjects tend to choose a primary and secondary object -On average, the primary object accounted for 59.53% of fixations, and the secondary object for 40.47%.However, none of the experimental variables have a significant effect on the fixation ratio.The object complexity has an significant effect on single, couple, triple, quintuple, octuple and higher fixation groupings.Notably, for single, couple, and triple, the probability of occurrence significantly decreases with increasing object complexity.While for octuple and higher groupings, the opposite is true -their occurrence increases with increasing object complexity.(c) The object orientation has a significant effect on single and couple groupings as this is the dominant method for object orientation 0 • and decreases steadily with increasing object orientation.Similar to object complexity, larger fixation groupings are affected by object orientation as well.Specifically, septuple and higher groupings occur more frequently with increasing object orientation.(d) The sameness of the object had largely no significant effect on fixation groups.Only single and septuple groups are more significantly used for the same objects than different ones.(e) As trials advanced, subjects used single groupings progressively less.A similar trend is observed for couple, triple, quadruple and quintuple groups -none are significant, however.Larger groupings see an increase in probability as trials proceed.Notably, only a few sparse data points are recorded for octuple pairings up to trial 8. Octuple pairings are more regularly seen for trials 9-19.(f) The correctness of the answer significantly correlated with single and septuple pairs.Single and septuple fixation groups are significantly used more for correct answers than error responses.

Fig. 2 :
Fig. 2: A) Illustration of TEOS objects used as stimuli.The set is split into three different complexity levels.Complexity is defined as the number of blocks used to build an object.B) Expanded view drawing of the custom clip-on tracking equipment.It uses 8 rotationally variant positioned tracking markers to avoid ambiguities.C) Photograph of the assembled eye tracking glasses with tracking equipment.

Fig. 3 :
Fig. 3: A) Top-down illustration of the experimental setup.It shows dimensions, as well as the three different starting positions investigated.B) We have investigated three orientational differences between the stimuli.0 • (top) means that there is no rotational difference between both object poses -the object rotations are aligned.90 • and 180 • means that the poses have a rotational difference of 90 • and 180 • , respectively.For all trials, we have used the same posesas shown in this illustration.
V. FUNDING This material is based upon work supported by the Air Force Office of Scientific Research under award numbers FA9550-18-1-0054 and FA9550-22-1-0538 (Computational Cognition and Machine Intelligence, and Cognitive and Computational Neuroscience portfolios); the Canada Research Chairs Program (grant number 950-231659); Natural Sciences and Engineering Research Council of Canada (grant numbers RGPIN-2016-05352 and RGPIN-2022-04606).
Fig. S2: The number of fixations against different experimental variables.(a) The effect of the starting position.For Cm and C h cases, starting from Ps resulted in the most fixations on average, followed by starting from the Pc and P l .(b) The effect of object orientation.Orientations 0 • and 90 • are similar, varying only a few fixations for the median and upper and lower quartile.In terms of absolute values, a few trials of C h and orientation of 0 • required about 800 fixations.Notably, these trials started from P l .Larger orientation differences required significantly more fixations regardless of object complexity.(c) The evaluation of sameness against the number of fixations.The same pairings always required significantly more fixations than different pairings.The same pairings needed at least 10, in some cases up to 20, fixations on average more.Additionally, error responses required significantly more fixations than correct answers.(d) A significant learning effect with respect to the number of fixations is observed.(e) The effect of correctness.Error responses result in more fixations than correct answers.

Fig. S5 :Fig
Fig. S5: A plot of measured normalized variables (accuracy, fixations, response time and movement) with respect to trial number.It is easily seen that every variable improves over the course of the trials except for accuracy.
Fig.S7: The result of the analysis of fixation groupings, i.e., the number of fixations on one object before changing focus to the other object.On average, 18.7% (σ = 10.01%) are fixations that change focus between each object every time.(a) The object complexity has an significant effect on single, couple, triple, quintuple, octuple and higher fixation groupings.Notably, for single, couple, and triple, the probability of occurrence significantly decreases with increasing object complexity.While for octuple and higher groupings, the opposite is true -their occurrence increases with increasing object complexity.(c) The object orientation has a significant effect on single and couple groupings as this is the dominant method for object orientation 0 • and decreases steadily with increasing object orientation.Similar to object complexity, larger fixation groupings are affected by object orientation as well.Specifically, septuple and higher groupings occur more frequently with increasing object orientation.(d) The sameness of the object had largely no significant effect on fixation groups.Only single and septuple groups are more significantly used for the same objects than different ones.(e) As trials advanced, subjects used single groupings progressively less.A similar trend is observed for couple, triple, quadruple and quintuple groups -none are significant, however.Larger groupings see an increase in probability as trials proceed.Notably, only a few sparse data points are recorded for octuple pairings up to trial 8. Octuple pairings are more regularly seen for trials 9-19.(f) The correctness of the answer significantly correlated with single and septuple pairs.Single and septuple fixation groups are significantly used more for correct answers than error responses.

Fig. S8 :
Fig.S8: Here, we show a group of fixations that go back and forth between both stimulus.Both objects are displayed in the orientation of their observation.The corresponding fixations are highlighted with red circles and a green border.Arrows point to and originate at the center of gaze.Further, the starting fixation is provided (annotated with "Start"), and the subsequent fixation is connected with an arrow.The alternating fixation ends at the fixation marked with "End."The two objects are Cm; they are the same object, presented at 90 • orientational difference.The mean accuracy for gaze fixations is 1.42 • .Color encoded with uncertainty boundary in green.