An extensive dataset of eye movements during viewing of complex images

We present a dataset of free-viewing eye-movement recordings that contains more than 2.7 million fixation locations from 949 observers on more than 1000 images from different categories. This dataset aggregates and harmonizes data from 23 different studies conducted at the Institute of Cognitive Science at Osnabrück University and the University Medical Center in Hamburg-Eppendorf. Trained personnel recorded all studies under standard conditions with homogeneous equipment and parameter settings. All studies allowed for free eye-movements, and differed in the age range of participants (~7–80 years), stimulus sizes, stimulus modifications (phase scrambled, spatial filtering, mirrored), and stimuli categories (natural and urban scenes, web sites, fractal, pink-noise, and ambiguous artistic figures). The size and variability of viewing behavior within this dataset presents a strong opportunity for evaluating and comparing computational models of overt attention, and furthermore, for thoroughly quantifying strategies of viewing behavior. This also makes the dataset a good starting point for investigating whether viewing strategies change in patient groups.


Background & Summary
By moving our eyes in fast and ballistic movements our oculomotor system constantly selects which parts of the environment are processed with high-acuity vision. The study of this selection process spans several levels of neuroscientific analysis because it requires relating behavioral models of viewing behavior to the activity of individual neurons and brain networks. One of the key challenges for understanding the neural basis of selecting saccade targets is therefore to establish behavioral models of viewing behavior. Such models depend on an appropriate task for sampling viewing behavior from observers. One natural possibility is free-viewing of pictures and other stimuli. We define free-viewing as a task that imposes no external constraints on what locations or parts of a stimulus should be looked at. Instead, what locations are interesting or rewarding are defined internally by the observer. The lack of external constraints has two important advantages. On the one hand, it naturally leads to a rich variety of viewing behavior across observers and stimulus categories that is nevertheless highly structured 1 . On the other hand, it implies that the task requires almost no training and undemanding instructions, such that it can easily be executed by children 2 , cognitively impaired individuals, and a variety of non-human species 3,4 . These properties make free-viewing ideally suited for the study of complex oculomotor control behavior.
Yet, because observers might select different viewing strategies, the analysis of free-viewing data requires data across many observers and stimuli. Presently, a number of datasets are publicly available. Specifically, this includes datasets that document viewing behavior of a rather small number of subjects on a large number of images 5,6 . However, studies combining a sizable set of stimuli and a larger number of subjects are sparse 7 . A more complete list of different contributions can be found at http://saliency.mit. edu/datasets.html. Here, we present a dataset of eye-movement recordings from 949 observers who freely viewed images from different categories to address this issue. We believe that this dataset will be a valuable resource for investigating behavioral and neural models of oculomotor control. First, computational modeling of viewing behavior is a challenging research field that depends on a gold standard for model evaluation and comparison. With 2.7 million fixations, the presented dataset will significantly increase the size of the corpus of available eye tracking data. Second, the size of this dataset allows fine-grained analysis of spatial and temporal characteristics of eye-movement behavior. This is an important aspect, since eye-movement trajectories are highly structured in space and time [8][9][10][11] , and increasing the temporal window of analysis requires increasing the amounts of data. Third, this dataset might act as a reference to identify changes in oculomotor control in specific subpopulations, e.g., after stroke or due to mental illness.
In summary, this unique dataset of viewing behavior will allow evaluations of models of viewing behavior against a large sample of observers and stimulus categories (Data Citation 1). In the following sections, we describe the origin of the contained data, detail pre-processing steps performed, and show how to use the overall dataset. We also give a short overview of basic properties of the dataset to allow other researchers to assess its usefulness for their own research questions.

Methods
Our dataset contains about 2.7 million fixation locations from 949 observers, which viewed a total of 1,474 images (250 images each have fixations from more than 115 observers) from different image categories. The dataset aggregates data from 11 different published studies and adds 9 studies that have not yet been published. The main goal of this dataset is to combine these diverse studies and to harmonize their metadata to make them easily accessible for a larger audience. Tables 1 and 2 and Fig. 1 give an overview of the studies included in the dataset. The following paragraphs describe the general acquisition procedure that is common throughout the dataset.
Gaze coordinates were acquired with either a head mounted Eyelink II or remote EyeLink 1000 eye tracking system (SR Research Ltd., Ottawa, Ontario, Canada), sampled monocularly at 500 Hz. Operators of the gaze tracking system participated in a standardized training course before conducting a study, and thereby followed the same recording procedures (a detailed description is included in the dataset and available online at http://cogsci.uni-osnabrueck.de/~nbp/EyeTrackingInstruction.html). Accuracy of the gaze tracking system was checked with calibration and validation sessions before data recording. A general guideline for all recordings was to achieve an average validation error below 0.5°and to keep the maximal error below 1°. Studies that used the head mounted Eyelink II system additionally carried out repeated drift correction trials to compensate for slip of the eye tracker. The experimenter repeated calibration and validation sessions after breaks and whenever the drift correction error surpassed a predetermined threshold (usually >1°error). Participants removed any eye make-up before recording sessions to facilitate gaze tracking accuracy. Both systems were able to cope with most types of glasses and contact lenses. All participants had normal or corrected-to-normal visual acuity and were naïve to the purpose of the study. All studies were approved by either the ethics committee of the University of Osnabrück or the ethics committee oft the chamber of physicians in Hamburg. All participants gave written and informed consent before the start of the study. They were compensated monetarily (usually 5 €/h) or in the form of course credits.
The eye tracking systems were capable of recording gaze location at high temporal frequency. They automatically generated fixation location and times from the raw gaze location time series, which were stored in the datasets. All studies used the SR-research default system parameters to define saccades: an acceleration threshold of 8000°per sec 2 , a velocity threshold of 30°per sec, and a deflection threshold of www.nature.com/sdata/ SCIENTIFIC DATA | 4:160126 | DOI: 10.1038/sdata.2016.126 0.1°. Fixations were defined as time periods without saccades. The dataset therefore consists of (x,y) gaze location entries for individual fixations. Coordinates were given in pixels with respect to the monitor coordinates (the upper left corner of the screen was (0,0) and down/right was positive). In many cases we also provide raw sample based data that can be used to validate fixation detection settings. Fixations were labeled with a subject ID, start and end times, image category and image number, the ordinal rank of the fixation within a trial (see Table 3 (available online only)), the trial within an experimental session, and a dataset ID that refers to the source study. Each study might define additional information for a fixation, such as experimental condition and subject specific information (see Table 3 (available online only)).
During construction of the dataset, we harmonized file and category names across studies to ensure that stimulus and category indices referred to the same stimuli. An important consequence of this harmonization was that the dataset contained stimuli in their original size only. Since stimuli might have been presented on different displays with different resolutions and sizes, the user of the dataset has to transform the gaze locations to match the original stimulus or to rescale the stimuli to the size used during presentation. Table 2 gives stimulus sizes, display resolution (in pixel and degree), stimulus position on the screen, viewing distance, and pixels per degree.
The dataset contains anonymised data, where a numerical ID identifies studies and participants. No personal information is contained.
The following paragraphs provide more information about the individual studies. Examples of stimuli are provided in Fig. 2   The stimuli were either unmodified fractals or globally and/or locally modified images derived from the same fractals. Global modifications concerned the addition of varying degrees of noise to the phase spectrum of the fractals. Local modifications entailed local increases or decreases in luminance contrast at five locations. The viewing duration was 5 s. After exploration of the stimuli, participants performed a recognition task. The same stimulus was shown together with the unmodified or another luminance modified version of the stimulus. The observer's task was to identify the one they had just seen.
Age study [ID 0, 58 participants, 10.5 × 10 4 fixations] This is a patch recognition experiment that compares viewing behavior of three different age groups (school children, students, and elderly) 2 . Participants saw 64 images from the categories Natural, Urban, and Fractal, and 63 images from the category Pink-noise. Stimulus presentation was balanced such that pairs of observers saw all images within each of these four categories (255 in total). The presentation time was 5 s.    This study 13 investigated the influence of handedness and spectral content on the occurrence of horizontal biases during free-viewing. It corresponds to the data of the second experiment in ref. 13. The experiment consisted of 31 right-handers and 17 left-handers. Participants viewed 120 images for 6 s each. Each image was preceded by a drift correction. Images were presented either in an original or mirror-reversed version, and either with full spectral content or low-pass or high-pass filtered (Gaussian filter, cutoff of 0.6 c/degree). Each subject explored only one version of each image. In half of the trials, the drift correction fixation dot remained visible for 1s after the stimulus onset, and we requested subjects to keep fixating until it disappeared (delay trial). If a subject's gaze moved away from a radius of 1 visual degree from the center, the trial terminated, and a feedback message was delivered. Delay and non-delay trials were blocked across the experiment.

Gap [ID 22, 24 participants, 4.9x10 4 fixations]
This study 13 investigated the influence of drift-correction trials on horizontal biases during free-viewing. It corresponds to the data of the third experiment in ref. 14. 24 right-handers participated in the experiment. Participants viewed 120 images for 6s each. We introduced temporal gaps of 0, 300, 600, and 900 ms between the disappearance of the fixation dot in the middle of the screen for drift-correction and the appearance of the images. During the temporal gap, the screen was at the gray scale level of the driftcorrection period, and the gap duration was randomized across trials. Subjects did not receive any instruction in relation to the existence of a gap.   additionally bit into a mouth guard fit for each participant. Stimuli were randomized, such that pairs of observers saw all images in all conditions, i.e., each observer saw 32 images from a category in the head fixed and 32 in the head free condition. The study also contained a guided viewing task where observers had to follow a point which jumped to a new location once it was fixated. In this experiment, the average validation error did not surpass 0.55°with the exception of subject 6 (0.74°in condition 1).
Memory I [ID 4, 45 participants, 17.9 × 10 4 fixations] Participants freely observed 48 images in a randomized order and with five repetitions 15,16 . They consecutively saw 5 blocks of all images. The block number is coded as 'iteration'. The images equally covered four categories, namely Natural, Urban, Fractal, and Pink noise images. Presentation duration was 6s for each image. Before an image appeared, participants had to fixate on a cross presented in the center of the screen. A short 5 minute break after the third presentation block maintained participants' alertness and avoided potential fatigue.
Memory II [ID 5, 34 participants, 10.9 × 10 4 fixations] The design of this study 15 was similar to that of Memory I with exceptions noted. Participants repeatedly explored 30 urban images for 6s each. The images differed regarding their complexity and were grouped in 5 consecutive blocks. Ten images depicted global scenes containing many houses, streets, and other objects (high complexity); 10 images depicted local arrangements such as single houses (medium complexity); 10 images depicted close-ups of urban details, such as park benches or staircases (low complexity). Four independent raters judged image complexity and showed a perfect inter-rater agreement. A high image resolution (2560x1600px) conserved details for an in-depth exploration. After the experiment ended, participants once more observed all images for 6s. However, this time they were asked to explore those image regions that they considered uninteresting. We conducted this additional trial for exploratory reasons. The corresponding data have not been included in the published results but are included here (iteration = 6).

Monocular [ID 12, 68 participants, 31.4 × 10 4 fixations]
This unpublished study investigates the occurrence of viewing biases in monocular vision. All participants viewed the images with their right eye, the left eye was occluded with an eye patch.  Participants freely observed 240 images for 6s each. All images were shown at 30 00 , some images were resized by bicubic interpolation to the corresponding ratio and resolution of 2560 × 1600. This recognition experiment presented fractals with local contrast modifications and phase scrambling. The base stimuli were identical to the AFC study. Participants explored stimuli for 5 s. Subsequently, the participants indicated whether a local image patch, taken from the previous or a randomly selected stimulus, originated in the previously explored stimulus or not. This study 18 investigated visual exploration of natural images under stereoscopic presentation conditions using specialized equipment. 3d images of natural scenes were taken using a pair of digital cameras. These photographed scenes were also laser-scanned to obtain the ground-truth depth structure of the scenes. These depth-maps allowed presentation of the depth structure independent of image content and therefore made it possible to study the influence of binocular disparity information on eye-movements. Each image was presented either stereoscopically (3d) or not (2d). Furthermore a given depth map was presented either with its corresponding luminance information (natural), or following spectral modifications (pink noise or white noise), leading to 6 conditions across 2 factors. Presentation duration was 20 s. Participants were required to press a button as soon as they recognized at least two depth layers in the images. This study 19 investigated how visual and auditory sources of information were integrated during freeviewing of natural images, and 64 natural images were shown, either presented from the left or right side of the monitor (Audio-visual conditions, AVL or AVR) or without any sounds (Visual condition). Sounds were played during the presentation of visual stimuli through speakers flanking the monitor. Presentation time was 6s. Auditory stimuli consisted of natural sounds (e.g., bird sounds). During the auditory condition, sounds were played while white noise images were presented. Subjects were instructed to study the images and listen to the sounds carefully. This study was designed to investigate exploration and exploitation on stimuli with varying spatial properties. Participants freely observed 360 images from the categories urban, nature, and webpages for 6s each. The images were presented in five different sizes (7 00 , 10 00 , 15 00 , 21 00 , and 30 00 ). The 30 00 images served as the full size condition. The remaining sizes were achieved by either scaling down the image coordinates from 30 00 to the desired size or by cropping out the central part of the 30 00 image according to the desired size. The field 'scaled' indicates whether a stimulus was scaled or cropped. The background color for smaller images was set to neutral gray (RGB color: 128, 128, 128). Participants saw screenshots of 90 websites in three different task conditions 21 . Stimulus presentation was balanced such that triplets of observers saw all stimuli in all conditions. The first task was a free-viewing task in which participants were instructed to 'simply explore the website' for 6 s. The second task, the content awareness task, was similar, but participants had to select a target user group for each site afterwards. The third task presented a search term before stimulus presentation and participants had to rate how well the website fit to the search term. The dataset contains fields that encode the user group rating, the shown user groups, the relevance of the search term, and a familiarity rating of the website.
Webtask @ School [ID 14, 24 participants, 4.0 × 10 4 fixations] This study is similar to the webtask study. A subset of 60 webtask stimuli was shown to school-children attending 6th grade in a secondary school in a small town in Germany. All other aspects were equal to the webtask study.
APP [ID 6, 73 participants, 9.9 × 10 4 fixations] This study 14 investigated eye-movements leading up to and following the initial perception of ambiguous and disambiguated line drawings. Data from 73 naïve participants were included. They viewed 11 ambiguous stimulus sets, each including an ambiguous and two disambiguated stimuli, as well as 36 control stimuli. Participants freely explored the images in order to identify what was shown. They pressed a button upon successful recognition. Following the button press, the stimuli remained visible for another 4 s. Afterwards, participants indicated prior knowledge of the stimulus and rated their perceptual certainty.
APPC [ID 7, 46 participants, 1.2 × 10 4 fixations] Similar to APP above, participants freely explored line drawings with the goal of identifying the content 22 . Contrary to APP, this paradigm placed the drawings in context. These were congruent with one of the two interpretations of the ambiguous stimulus. Triggered by the first saccade, the context was immediately taken off screen, and the experiment then followed the procedures in APP. Eight ambiguous and disambiguated stimulus sets were included, as well as eight unambiguous control stimuli. Data from 46 participants were included in the dataset. This study investigated eye-movements during a face discrimination task. Faces were computer-generated to form a circular similarity continuum spanning 360 degrees in steps of 11.25 degrees (32 faces). Participants were randomly associated with a pair of opposing faces (separated by 180 degrees, labeled 0 or 180). In each trial one of the reference faces was shown (duration: 1.5 s) together with a different test face. Participants reported at the end of the trial whether the two faces were the same or different.
Depending on the performance of the participant, an adaptive algorithm decided on the angular distance between the reference and test faces for the next trial (for example: 0 degrees vs 22.5 degrees or 180 degrees vs 168.75 degrees). Two psychometric functions, mapping angular distances to the probability of perceiving a difference, for the two reference faces were derived. The same discrimination task was repeated following a learning procedure (see Face Learning below), which required participants to associate an aversive outcome with one of the faces. Stimuli spanned 27°to approximate face sizes during everyday interactions. This study tested the effect of aversive associative learning on the exploration of faces. Eight faces, separated by 45 degrees, were selected for this experiment (see Face Discrimination above). During the conditioning phase, one randomly selected face was paired with an aversive outcome (mild noxious stimulation of one hand in 33% of trials), whereas the most dissimilar face (separated by 180 degrees) was kept neutral. Following this learning phase, all faces were presented and the effect of aversive learning on the exploration of faces was investigated. Before aversive learning (baseline phase), faces were all neutral, and the aversive stimuli were delivered in a predictable manner following a non-face symbol. As in the face discrimination task, stimulus duration was 1.5 s. Subjects were required to press a button as soon as an oddball target (blurred face) was presented. Before and after the aversive learning, some participants performed a perceptual discrimination experiment (see Face Discrimination above).

Code availability
We provide python and MATLAB code to load the dataset. Python code was tested with python 2.7, h5py version 2.5.0 and HDF5 version 1.8.15. We tested MATLAB code with version 8.3.0.532 (R2014a). This code is distributed with the dataset and subject to the same license.

Data Records
The dataset consists of one HDF5 file ('etdb_1.0.hdf5'), which contains eye tracking data, a folder that contains stimuli ('Stimuli') and one semicolon-separated text file ('meta.csv', semicolon-separated file with UTF-8 encoding) that contains experimental metadata associated with each individual dataset (Data Citation 1). The file 'etdb_1.0.hdf5' is a standard HDF5 file created with h5py version 2.5.0 and HDF5 version 1.8.15. HDF5 allows the structuring of data into groups much like a file system organizes data with folders and files. In this case, each study in the dataset is stored in a group whose name corresponds to the study name. Within each group, we store vectors that encode information about fixations. Each index of these vectors encodes a fixation, i.e., accessing etdb_1.0.hdf5 at AFC/x [10] retrieves the horizontal location of the tenth fixation in the AFC study. Table 3 (available online only) shows what information is encoded for every study in 'etdb_1.0.hdf5 0 . Some experiments require additional information to correctly interpret data from a trial. For example, the webtask study presented search terms, potential user groups, and URLs in some of the trials. This information is represented for each fixation by a linear index into an attribute of a group. For example, if the 'url' field 'etdb_1.0.hdf5/Webtask/url [5]' is 2, then the corresponding url is encoded in 'etdb_1.0.hdf5/Webtask/attrs/url [2]'. This index is 1-based, i.e., 1 refers to the first element in an attribute list.
The file 'meta.csv' is a csv file with a table that contains meta-information about each study. In particular, it contains stimulus sizes, display sizes (in pixel and degree), and a conversion factor to translate pixels to degrees of visual angles. This allows mapping fixation locations onto stimuli.
The stimuli are located in 'Stimuli/', which contains subfolders for each stimulus set. Stimulus sets are encoded numerically (6-Websites, 7-Natural, 8 Unfortunately we were not able to obtain the rights to publish four of the 64 fractal stimuli in category 10 under a CC0 license. Some of these were obtained from fractal collections on the internet whose authors we were unable to contact. However, we made sure that all fractals are free of use for research purposes. We can provide these stimuli upon request.
We also distribute additional raw data files and metadata wherever available. Metadata is distributed as comma separated text files that map subject IDs to metadata. Each file contains descriptions of the respective columns. These files can be found in the folder 'additional_metadata/'. Sample based data is provided, wherever possible, as additional HDF files with a similar structure 'etdb_1.0.hdf5'. Instead of fixations each vector here contains x,y locations of each sample provided by the eye tracker. Field names are the same as in the fixation based dataset. Sometimes fields will be prefixed by 'left' or 'right' to distinguish which eye was tracked. In this case x,y positions are encoded in fields called 'left_g{x,y}' or 'right_g{x,y}'. Sample based data files can be found in the folder 'additional_samples/'.

Technical Validation
One of the most important aspects of the reliability of gaze-tracking is its spatial accuracy. The data in this dataset were recorded with two high precision eye trackers (Eyelink II and Eyelink 1000) that are known for their high accuracy. Furthermore, a calibration and validation session preceded every recording block and data recording was only started when the average error fell below a pre-specified threshold. The threshold depends on the study ( Table 2), but is always smaller than 0.6°of the visual angle. Studies that used the head mounted Eyelink II system frequently checked tracking accuracy by presenting drift correction trials. In these trials, participants fixate on a dot, which allows calculating the measurement error of the tracking system.
A second important aspect of reliability is the temporal accuracy of saccade onsets and offsets. Data in this dataset were sampled at 250 or 500 Hz, which is very fast in relation to fixation durations (200-300 ms). Figure 1b shows a histogram of fixation durations for all contained studies.
A final consideration is the proficiency of users that operate eye tracking equipment. A standardized training system ensured proficiency. It teaches all new users how to operate the equipment and how to deal with common difficulties (e.g., make-up, glasses, etc.). Users at the University-Medical Center in Hamburg-Eppendorf all underwent the same training procedure.

Usage Notes
This dataset is distributed in open and standardized file formats (HDF5, text, PNG) and can therefore be processed with many software packages. In particular, we made sure that the data can easily be read with python, R, and MATLAB.
Users should keep in mind the following caveats. First, the duration of a fixation is encoded by its end-start time. Please note that the end and start time themselves are meaningless, since they are expressed relative to some unknown point within the experiment. Second, mapping fixations to stimulus locations requires either mapping gaze locations onto the stimulus or scaling the stimulus appropriately. For example, the stimuli in the 'Head Fixed' study were shown on a screen with a 16:9 aspect ratio while the images were 4:3. This leaves a gray border of 240px to the left and right of the image, which are not included in the image file in the dataset. Horizontal (x) coordinates smaller than 240 pixels and larger www.nature.com/sdata/ SCIENTIFIC DATA | 4:160126 | DOI: 10.1038/sdata.2016.126 than 1680 pixels are therefore outside the image. Third, in most cases participants had to fixate a fixation dot before stimulus onset, and the first fixation within a trial can be driven by this fixation dot (in some studies fixation onset times o0 are indicative of this). In some cases, the fixation dot remained visible for a while after an image change, or there was a gap between disappearance of the dot and appearance of the image. In these cases, the trial 0 time corresponds to the onset of the image or of the gap period. Fourth, in some experiments, images were presented in their original and mirrored versions. Since images were provided only in their original versions, these images need to be left-right flipped when mapping gaze coordinates from mirrored trials to images.