TüEyeQ, a rich IQ test performance data set with eye movement, educational and socio-demographic information

We present the TüEyeQ data set - to the best of our knowledge - the most comprehensive data set generated on a culture fair intelligence test (CFT 20-R), i.e., an IQ Test, consisting of 56 single tasks, taken by 315 individuals aged between 18 and 30 years. In addition to socio-demographic and educational information, the data set also includes the eye movements of the individuals while taking the IQ test. Along with distributional information we also highlight the potential for predictive analysis on the TüEyeQ data set and report the most important covariates for predicting the performance of a participant on a given task along with their influence on the prediction. Measurement(s) intelligence • eye movement • Socioeconomic Factors Technology Type(s) culture fair intelligence test (CFT-R) • eye tracking device • Questionnaire Sample Characteristic - Organism Homo sapiens Measurement(s) intelligence • eye movement • Socioeconomic Factors Technology Type(s) culture fair intelligence test (CFT-R) • eye tracking device • Questionnaire Sample Characteristic - Organism Homo sapiens Machine-accessible metadata file describing the reported data: https://doi.org/10.6084/m9.figshare.14173244


Background & Summary
For many decades, research in various fields has been devoted to the question of what constitutes human intelligence 1,2 , the ways in which it develops in the course of our life 3 , and how it can be positively influenced 4 . Additionally, recent developments in the field of artificial intelligence are pushing methodological approaches towards the mimicking of human reasoning and problem solving strategies.
To support the research community and the work at the intersection of psychological and educational sciences and artificial intelligence, we provide the TüEyeQ data set (see Fig. 1). We collected a comprehensive data set from 315 university students performing a culture fair intelligence test (CFT 20-R). The CFT 20-R consists of 4 blocks, each with a fixed time limit and items that successively increase in difficulty. Along with the performance data, we provide socio-demographic and educational background information on the students as well as carefully annotated eye movement data of all participants during task solving. We believe that this data set will boost the research in various fields and will contribute to highly interesting research questions: • For psychology and psychometric research the question of how perception through eye movements and IQ (as measured by the CFT 20-R test) relate to each other can be thoroughly analysed (e.g. 5 ). Also the question which tasks and how many of those tasks are needed to derive the respective conclusions can be explored. • For educational research, the relationship between an individual's performance on a standardized IQ Test and socio-demographic and educational background can be further explored. • From a socio-economic perspective the question of whether the IQ-related performance of a designated and coherent group of people can be boosted by adequately assigning tasks and aggregating answers can be further analyzed (e.g. 6,7 ). • From a data science perspective the data set provides valuable means to analyse performance bias with respect to background information on the participants, such as education, training, viewing behaviour, gender and many more.
www.nature.com/scientificdata www.nature.com/scientificdata/ • From the cognitive science perspective, this data set can make an important contribution to the study of strategy-related indicators in the context of (complex) problem solving (e.g. 8 ). Additionally, TüEyeQ can support research on the relationship between eye movements and (fluid) intelligence (e.g. 9,10 ). • In the research field of human-computer interaction our data set can be used to explore various aspects of cognitive load and adaptation. More specifically, the user performance and behavioral data as described by the eye movements and pupillary measures can be employed to identify indicators of cognitive load and how they can be used to predict task difficulty under time constraints. • From the AI perspective the question of whether machine learning algorithms can learn to reason as humans and whether it is possible to develop an AI system that correctly solves such a test is among current challenges in AI research (e.g. [11][12][13] ).
Methods experimental design description. For TüEyeQ, 315 healthy participants (217, female, 94 male, 4 not stated; with an age mean of 23.272 years, SD 3.022) with a university entrance qualification, without neurological or psychiatric pre-existing conditions, and no visual impairment above 3 dioptres participated in our study. All participants underwent a large-scale study which aims at investigating the interaction between different partial abilities relevant for self-regulation in educational contexts. For this purpose, 321 participants were recruited. They performed different cognitive tasks (measuring, for instance, executive functioning or IQ) and filled in multiple questionnaires regarding self-regulation indicators (e.g., motivational beliefs or personality traits) in a lab setting during three sessions, each lasting up to 4 hours. Unfortunately, we had to exclude 4 participants because they did not complete all study sessions, and two more participants were excluded because of technical issues. For their participation in the study, the participants received remuneration of 8 EUR per hour and additionally 15 EUR in case they participated in all sessions. The participation in the study could be revoked at any time. All collected data were anonymized and treated confidentially. In this work, we report specifically on performance data of this cohort of participants in an IQ test, their eye movement behavior during solving this IQ test and their socio-demographic and educational background characteristics, including information on software usage and leisure time activities. The questionnaire on socio-demographic and educational background of the participants was the first questionnaire of the first session of the study. The participants completed the CFT test immediately afterwards.
The Ethical Review Board of the Psychological Institute at the University of Tübingen approved the protocol of the study. All participants were informed in written form and consented that their anonymous data can be analyzed and published. Due to a self-constructed pseudonym, they had the option to revoke this consent at any time.
The CFT 20-R test. In our experiment, we employed the first part of the revised version of the culture fair intelligence test (CFT-R) designed by Weiß et al. 14 . This IQ-test is intended to measure the general mental capacity (i.e., the g-factor of intelligence or fluid intelligence) by means of tasks that require the ability to recognize figural relationships and to engage in formal logical thinking in problems of varying degrees of complexity under time restriction. Since the tasks administered are language-free, it is assumed that people with poor language knowledge and poor cultural techniques are not disadvantaged. The CFT 20-R consists of four blocks of different tasks, namely series continuation, classifications, matrices and topological conclusions. Each of these blocks has 11-15 items with increasing difficulty and a time limit of 3-4 minutes. Details and exemplary test items are presented in Fig. 2.
In order to record the eye movements of the participants during the task, we adapted the classic pen-and-paper version of the IQ test to a digital one that can be displayed on a computer screen. To be as close to the paper version as possible, we presented as many items as possible on a single screen page as long as this did not necessitate scrolling. Tracker   315 participants  Socio-demographic information  CFT 20-R Performance Data   30  35  40  45  50  55 Ag gre ga te d CFT www.nature.com/scientificdata www.nature.com/scientificdata/ Participants first received general instructions about the nature of the test, followed by the first block. Each block had a specific instruction, introducing the participants to the requirements of this block and demonstrating the essence of the task based on 3 examples. The instruction phase was conducted without time constraints, thus all participants could go through the examples and make themselves familiar with the test procedure. All instructions were presented in German using the SoSci Survey online platform.

SMI Red250 Eye
Data acquisition. Data collection took place in a digital classroom equipped with 30 remote eye trackers attached to laptops with 17inch HD display screens running at full brightness with a resolution of 1920 × 1080. This setup allows for data collection of up to 30 participants simultaneously, minimizing the overall time needed for collection. For this study, verbal instructions were given en masse pertaining to a brief overview of the protocol and an explanation of eye tracking, then individual calibrations were performed with a supervised quality check. Interactions between the participants and the computer took place via mouse or touch pad, depending on participants' preference. The distance between participants and their respective screen was 50-70 cm depending on participants' preferences.
The collection environment had the room illumination level controlled with no effects from sunlight or other outdoor light. The standard maintained illuminance for the experimental sessions was between 10 to 50 lux, measured with a Lux sensor (i.e., Gossen Mavo-Max illuminance sensor, MC Technologies, Hannover, Germany).
Eye movement data was collected by means of SMI RED250 remote eye trackers, a commercial eye tracker with 250 Hz sampling frequency. Since the eye tracker has a high sampling frequency, both stable (fixations) and rapid (saccadic) eye movements for static stimuli can be measured. Eye movements were recorded using the included eye-tracking software Experiment Center which outputs the raw gaze data consisting of x and y coordinates of each data point, the timestamp information and the pupil diameter in millimeters.
Calibration was performed for all participants using SMI's built in 9-point calibration. A validation also was performed as a quality check to measure the gaze deviation for both eyes from a calibration point: A deviation larger than one degree required re-calibration. Calibrations were performed prior to the experiments as well as one or two times during the experimental session, depending on how many images were presented.

Data Preprocessing
Quality of eye tracking data. Initially, the raw gaze data was examined for signal quality using the eye-tracking software BeGaze provided along with the eye trackers. This software reports proportion of valid gaze signal to stimulus time as the tracking ratio. Therefore, if a participant's tracking ratio was deemed insufficient (i.e., less than 80% for at least a part of the task), we omitted his or her data. This pre-processing stage can assure that errors (e.g. post-calibration shifts, poor signal due to glasses) in the gaze data are substantially minimized. Consequently, eye-tracking data of 58 participants had to be omitted due to low tracking ratios. Further 11 data sets were excluded due to errors in the presentation software and another 17 because of incomplete data. This leaves us with eye tracking data for only 229 of 315 participants. The raw eye-tracking data was then pre-processed to improve the data quality and to extract several features.

Fixations.
Fixations are periods where the eye is stationary and a single location is gazed upon usually lasting between 200 ms and 350 ms 15 . Fixations not only contain information regarding what exact location participants look at, but also provide useful characteristics like frequency and duration. Longer fixations are associated with higher processing load and more effort 16,17 . As a further consequence, frequency of fixations increases as their duration increases. Fixation information was extracted from the eye-tracking protocols based on the I-VT algorithm 18 as implemented by the Perception's Engineer Toolkit 19 . As a minimum fixation duration a threshold of 75 ms was chosen.
saccades. Saccades are rapid eye movements that allow to change the focus of attention. As for fixations, saccades were extracted from the eye-tracking protocols based on the I-VT algorithm 18 with a velocity threshold of 30°/s. Since the velocity of saccades is not voluntarily controlled 20 , but depends on the neural activity, our data provides a powerful resource to study gaze parameters in cognitive engagement. www.nature.com/scientificdata www.nature.com/scientificdata/ Microsaccades. Microsaccades are fixational eye movements, which occur during an especially prolonged fixation. They are similar to regular saccades, but smaller, jerk-like, and happen involuntarily. Their properties are linked to visual attention 21,22 , perception 23 , working memory 24 , and task difficulty 25 . To identify microsaccades we followed the procedure proposed by Krejtz et al. 26 . The reported microsaccade-related features are comprised of the occurrence of microsaccades per fixation, their mean peak velocity, and their mean amplitude.
pupil-related features. Pupil diameter has been used as an indicator of cognitive load for a very long time.
Already in 1964, Hess and Polt reported that increasing task difficulty lead to increased pupil diameters 27 . Since then, this finding has been replicated across many different domains including short-term memory, language processing, reasoning, perception, sustained attention and selective attention [28][29][30] . Pupil dilation, however, is not only invoked by cognitive load 31 , but also emotional arousal 32 or lightning conditions, fatigue 33,34 , pain 35 , and distance to a fixated object 36 . A comprehensive overview of factors effecting the pupil is provided by Mathot 37 .
To enable further research in the above mentioned areas based on our IQ-Data, we pre-processed the recorded pupil signal to enhance its quality and reduce the influence of artifacts and noise. For example, blink artifacts were removed by removing eye-tracking data in a time window 100 ms before and after a blink. During this period, the pupil may still be effected by the blink and recorded diameters are unlikely to accurately reflect any of the above mentioned effects (e.g., cognitive load, fatigue, etc). Furthermore, we removed pupil diameter values smaller than 0 (i.e., measurement errors) and replaced them by missing values. Finally, we interpolated small gaps of up to 50 ms (which corresponds to 12 data points at a sampling rate of 250 Hz) and smoothed the data using a third order low-pass Butterworth filter with a 2 Hz cutoff as proposed by 38 .
From the processed pupil signal, we extract the mean which is provided as an additional feature along with fixation and saccade information.

Visual scanpath annotation.
For each test item, we carefully annotated the visual search behaviour as the spatio-temporal sequence of fixations and saccades, i.e. the visual scanpath, during task solving. Figure 3 shows an exemplary test item as included within Block 1 of the CFT-R test. The participant has to find the best matching shape (i.e., target) among four distractors, which provides the best continuation of the given shape series. Semantic annotations in so-called Areas of Interest (AoIs) is visualized with a black box around specific parts of the stimulus. The figure further features an exemplary scanpath of a participant, where fixation locations are visualized by ellipses and saccades as arrows or vectors. Figures 4 and 5 show the visual scanpaths of two participants solving the same task. The participant, whose visual scanpath is shown in Fig. 4, finds the solution (target) within very few fixations. Since the further options are not even visually explored, we can argue that the task was solved at high certainty. In contrast, the visual scanpath of the second participant, depicted in Fig. 5, reveals a visual exploration strategy which reflects a high uncertainty and anticipates a wrong solution to the given task. This example highlights the potential for further in-depth analysis of visual scanpaths and other eye movement measures related to (fluid) intelligence.
IQ task performance data. Performance in the CFT was dichotomized for each item, where one corresponds to a correct solution and zero to an incorrect solution. We decided to indicate items where no answer   www.nature.com/scientificdata www.nature.com/scientificdata/ was selected as missing values because these missing values convey additional information (e.g., items that were skipped are identifiable trough this coding). Due to technical problems, six items in the CFT had issues in a distractor or target. Specifically, the eleventh item from the matrices and the third item from the topological conclusions block displayed an erroneous target during the experiment (i.e., the correct answer was not included in the possible options). The four other items (from the six with technical issues) included an inaccurate distractor. To reconcile these issues and maintain the validity of the overall task performance multiple steps were taken. First, all affected items were marked in the data set (suffix 'e'). Second, the two items with erroneous targets were treated as missing, as participants were not able to solve them correctly. Third, sum scores for the items with inaccurate distractors, the regular items, and both, were compared through correlations. The analyses showed that performance on the items with inaccurate distractor and performance on the correctly displayed items (r(315) = 0.412, p < 0.001) as well as the overall sum scores with and without items with inaccurate distractors (r(315) = 0.992, p < 0.001) correlated significantly. This indicates that the items with an inaccurate distractor still measure performance that corresponds to overall task performance and are eligible for further analyses.
socio-demographic Information. In addition to the above mentioned performance and eye movement features, we report socio-demographic and educational background information for each participant. The background information also includes leisure and gaming activities, software and internet usage, programming experience, and many more. A detailed description of these features along with their representation (i.e., encoding) in the data set is provided in Table 1.
Specifically, we have categorized the leisure activities according to 39 , who distinguish between 13 recreational activities. Since most of our participants take part in several leisure activities, we mapped the activities to a one-hot encoding scheme, which corresponds to all features with the prefix "leisure_[…]" in the data set.
The International Standard Classification of Education (ISCED) provides a framework of definitions that enable a comparison of education systems. We categorized our "study_subjects" according to the Fields of Education and Training 40 of ISCED.
Similarly, the International Standard Classification of Occupations (ISCO-08) 41 defines different groups of occupations. However, we found that ISCO-08 does not adequately represent the range of occupations that was provided by our German participants. Specifically, ISCO-08 would have introduced a significant imbalance with respect to the "Professionals" category. Instead, we applied the taxonomy of stepstone.de (as of Nov. '20), which is one of the most popular online job markets in Germany. This taxonomy led to a much more fine-grained and up-to-date categorization in "job_father" and "job_mother".

Data Records
The TüEyeQ data set is available through the Harvard Dataverse Repository 42 .
The performance data along with the socio-demographic and education background information is provided as a single csv-file (cft-full) with 17,010 rows (observations) and 79 columns (features). For each of these features, Table 1 provides both the corresponding description and the encoding.
Eye movement information is provided in the folder (EyeMovementData) and comes in three formats. The first is a participant-and stimulus-wise raw format in folder (raw) that has only been pre-processed as described in the Section Data Preprocessing and consequently contains features for each fixation and saccade. Each subfolder here contains the raw eye-movement data for the participant with the participant ID being the same as the subfolder name. The therein contained files are named according to the corresponding stimulus (e.g., 1-3.csv means that this screen page contains the tasks 1 to 3 from the CFT test). Furthermore, a schematic version of each page of the CFT is provided in the folder (stimuli). For copyright reasons, this folder contains only the AOIs, the specific content of which can be retrieved from the CFT 20-R test 14 .
Mouse clicks of all participants are available in (clicks) -a csv file offering data about each click that was performed by a participant, describing thus, when and where a mouse click occurred, as well as on which part of what IQ task. This information is important for data analysis, since it details when a participant chose an answer and when he or she changed it.
In addition to the raw gaze data, we provide a convenience format stored in the folder (split), where the eye movement data is split up task-wise. This format features folders with participants' IDs that hold task-wise data (e.g., task_07.csv for the seventh task of the CFT 20-R). For this convenience format, participants with substantial offsets in their fixations or noise that could indicate extensive head movements were removed after a visual inspection of the raw eye movement data. This accounted for the removal of 43 participants reducing the set to 186. Most of these 43 participants may be included after manual correction of the fixations at the reader's discretion. The procedure for the assignment of events to tasks is described in short as follows: • Areas deemed relevant were the areas of each sub-task as well as the time remaining and the overall progress bar.
• Any other area was considered task-irrelevant • Fixations on task-irrelevant AoIs were removed • Saccades that neither started nor ended on a relevant AoI were removed • Event outliers were removed by considering the 5 events before and after this event. If less than 30% if these events shared the same AoI, the fixation was considered an outlier. • All events on the AoIs of a task that happened after the participant had performed their final click for that task were removed. • Events involving either the progress bar or the timer were assigned to the task that had the most recent valid gaze. Continued www.nature.com/scientificdata www.nature.com/scientificdata/ For each participant, the eye movement information is provided as cvs files which contain the eye movement and pupil features. All coordinates have their origin in the top left corner of the screen and the x and y axis extend to the right and bottom respectively. The csv files contain the following features: Eye movements and mouse clicks share the same timestamp format, which makes them easy to synchronize for analysis. Furthermore, eye movements, AoIs, and mouse clicks share a common coordinate system allowing for convenient use of all three data sources. www.nature.com/scientificdata www.nature.com/scientificdata/ Finally, we provide aggregated eye-tracking data for each participant and task that is already included in the ready-to-use format. This data is contained in the file named eye_features_aggregated.csv and consists of mean values for each participant and task. For each participant and task, this file provides mean values of fixation duration, saccade count, saccade amplitude, saccade duration, microssacade count, microsaccade amplitude, microsaccade duration, microsaccade peak velocity, and pupil diameter.

technical Validation
Bias considerations. As Fig. 6 shows, there is neither an age-related nor a gender-related bias with respect to the overall performance over all tasks. More specifically, there is no significant difference in performance between male and female participants across the different age groups (right plot in Fig. 6). The four participants whose gender was unknown were omitted in this analysis.
Distribution of the aggregated CFT score. Also, as shown in Fig. 7, the overall performance of the participants -encoded by the variable cft_sum_full -is, as expected, approximately normally distributed. As indicated by the distribution, there are 7 participants (with participant IDs: ' AAB14' , ' ACB13' , ' ATT30' , 'BUO15' , 'OAK22' , 'SKA02' , 'VWK01') -represented by the two bars on the very left-hand side of the distribution plot -who show an unexpectedly low overall performance. We hypothesize that these participants did not take the experiment seriously or simply went through the tasks without thinking carefully about the possible solutions. Many of the CFT tasks were left unsolved by these participants. performance distribution over the CFT tasks in the order of their appearance. Figure 8 shows a histogram over all solved tasks in the order of their appearance in the CFT 20-R. Moreover, for every task, we show the incremental expected number of participants who can correctly solve a given task from the ones seen so far. We see that while the incremental mean decreases, the number of tasks that were skipped by the participants increases, especially within CFT blocks. This is in accordance with the intuition that CFT tasks are ordered (especially within the CFT blocks) by increasing degree of difficulty.
Discriminative information for predicting a participant's performance on a given CFT task. In order to show that our data set does indeed contain discriminative information with respect to a participant's  www.nature.com/scientificdata www.nature.com/scientificdata/ performance on a given CFT task, we performed a binary classification on cft_task using the socio-demographic features. To this end, we ignored all observations with missing values (1,248 entries), i.e. all tasks, where no answer was provided. In general, however, these missing values carry semantic meaning and may be worth further investigating. Besides, we removed the feature participant because it represents an identifier and as such it inadequately increases the dimensionality of the feature vectors. The variable task_id, however, despite representing an identifier, reflects the order in which the CFT tasks occur and thus encodes the difficulty of tasks. Since CFT tasks are ordered by increasing order of difficulty within each CFT block and across blocks, we expect task_id to be a highly discriminative feature with respect to the performance of a participant on a given CFT task. We also removed cft_sum_full, which is an aggregation of the target and could thus cause information leakage.
We applied the following pre-processing steps: • We factorized the string features (i.e. we mapped them to integers).
• We imputed the NaN-values in all categorical features with a new category.
• We imputed the NaN-values in all continuous features with their median.
• Additionally, we normalized all continuous features to the interval [0,1], by using the MinMaxScaler of scikitlearn. The normalization was required for the Logistic Regression experiment, since we applied an 2 regularization.
We randomly split the data into a training (80%) and test set (20%) and trained a Logistic Regression model on TüEyeQ. We specified an 2 regularization and a maximum of 1,000 iterations to train the Logistic Regression model.
To illustrate the importance of input features for the prediction (see Figs. 9 and 10), we computed Shapley values according to 43 and Lime values according to TabularLime 44 . Specifically, we approximated Shapley values with LinearSHAP 43 for the Logistic Regression model. Furthermore, we also report the coefficients of the Logistic Regression model. Figure 10 and 9 (right) show that there tends to be an agreement between Shapely values, the . The continuous line is the incremental expected number of participants who can solve any given task from the ones seen so far correctly. For example, at the final task, top_11, we expect 240 out of 315 participants to solve a given task correctly. More specifically, after observing all tasks, each task has been solved correctly by 76.2% of the participants on average. The four task tasks with negligible technical issues during the data collection are highlighted by the four diagonally crosshatched bars. Fig. 9 The Receiver Operating Characteristic curve (left) and the highest coefficients (right) of a Logistic Regression model for predicting a participant's performance on a given task. More specifically, the model is aimed at binary classification of "correct" and "incorrect" answers (where cft_task was used as the target variable).
www.nature.com/scientificdata www.nature.com/scientificdata/ Lime attributions and the Logistic Regression coefficients with respect to the variable influences on the scores produced by the Logistic Regression model. More specifically, variables like task_id (which implicitly encodes the task difficulty) and grades_math along with other grades and variables related to gaming, online and leisure activities are assigned by all approaches high absolute scores, indicating a high importance of these aspects for the prediction model. Especially the importance of the gaming-related variables in the model is in alignment with recent research investigating the relationship between gaming and fluid intelligence (e.g. 45 ), thus providing through TüEyeQ new means for further analysis of such relationships in a thorough way including eye movement behavior.
Variables with highest correlation to the aggregated CFT score. Interestingly, as shown in Table 2, the two variables with the strongest Distance Correlation 46 to the aggregated CFT score, cft_sum_full, are grades_ math and mean_grade_degree. This is in alignment with the importance attribution results discussed above, where these two variables were assigned a high importance for the Logistic Regression model. However, in addition, we also see that other variables like grades_chemistry or grades_physics, which were not regarded as important by the previous attribution schemes, are among the variables with the highest correlation to cft_sum_full. These results together with those discussed in the previous paragraph show that the analysis of the variables' impact on the performance in an IQ test can be quite intricate and needs to involve different measures and cover various aspects.
Correlation between variables. The correlation scores in Table 3 correspond to the pairwise Distance Correlation coefficient for a random sample of 20% of the observations. Note that the Distance Correlation  Fig. 9. We used LinearSHAP and LimeTabular for the Logistic Regression.   www.nature.com/scientificdata www.nature.com/scientificdata/ measures both linear and non-linear relationships between two random vectors. As depicted in Fig. 11, the pairwise correlations are centered around 0.1. The peaked distribution indicates that most of the variable pairs are uncorrelated or very weakly correlated. As shown in Table 3, only 54 out of 3,081 variable pairs have a Distance Correlation above 0.6 (and only 62 pairs have a correlation above 0.5). For the Logistic Regression model described above, we did not remove highly correlated variables, which might entail collinearity issues. Hence, we believe that better predictability could be achieved if more effort was invested in the data analysis and preprocessing steps.

Usage Notes
The data can be downloaded from https://doi.org/10.7910/DVN/JGOCKI as a csv file. All experiments are provided as a Jupyter notebook and may therefore be easily reproduced.
When using the TüEyeQ dataset or parts of it, please cite this manuscript.
Besides the data provided in this work, the authors can provide access to further variables and performance data of the participants on other tasks upon request. Such data comprises for example information on the "Big Five" personality factors, on internal and external locus of control, self-efficacy, attribution styles, academic self-concepts, domain-specific interests, motivational strategies, process and questionnaire data on learning success, learning processes and learning prerequisites as well as performance data for test items measuring executive control functions such as, flexibility, updating, inhibition (n-back, operation and reading Span items, Stroop item performace, Stop Signal Test, Trail Making Test). In this case, a sub-selection of the dataset, whose size will be defined based on the specific requests from interested users and on the processing time needed will be provided. For any questions, suggestions or request of collaboration regarding TüEyeQ please contact the corresponding author. This data set is freely available under the CC0 license.

Code availability
All Python-code corresponding to the evaluations described in this work is distributed on GitHub under the MIT license https://github.com/haugjo/TueEyeQ. To run the evaluation, the following packages are required (note that older or more recent versions might also work): • python (v3. 7 www.nature.com/scientificdata www.nature.com/scientificdata/