Evaluation of an eye tracking setup for studying visual attention in face-to-face conversations

Vehlen, Antonia; Spenthof, Ines; Tönsing, Daniel; Heinrichs, Markus; Domes, Gregor

doi:10.1038/s41598-021-81987-x

Download PDF

Article
Open access
Published: 29 January 2021

Evaluation of an eye tracking setup for studying visual attention in face-to-face conversations

Antonia Vehlen¹^na1,
Ines Spenthof²^na1,
Daniel Tönsing²,
Markus Heinrichs² &
…
Gregor Domes¹

Scientific Reports volume 11, Article number: 2661 (2021) Cite this article

17k Accesses
17 Citations
4 Altmetric
Metrics details

Subjects

Abstract

Many eye tracking studies use facial stimuli presented on a display to investigate attentional processing of social stimuli. To introduce a more realistic approach that allows interaction between two real people, we evaluated a new eye tracking setup in three independent studies in terms of data quality, short-term reliability and feasibility. Study 1 measured the robustness, precision and accuracy for calibration stimuli compared to a classical display-based setup. Study 2 used the identical measures with an independent study sample to compare the data quality for a photograph of a face (2D) and the face of the real person (3D). Study 3 evaluated data quality over the course of a real face-to-face conversation and examined the gaze behavior on the facial features of the conversation partner. Study 1 provides evidence that quality indices for the scene-based setup were comparable to those of a classical display-based setup. Average accuracy was better than 0.4° visual angle. Study 2 demonstrates that eye tracking quality is sufficient for 3D stimuli and robust against short interruptions without re-calibration. Study 3 confirms the long-term stability of tracking accuracy during a face-to-face interaction and demonstrates typical gaze patterns for facial features. Thus, the eye tracking setup presented here seems feasible for studying gaze behavior in dyadic face-to-face interactions. Eye tracking data obtained with this setup achieves an accuracy that is sufficient for investigating behavior such as eye contact in social interactions in a range of populations including clinical conditions, such as autism spectrum and social phobia.

GazeBaseVR, a large-scale, longitudinal, binocular eye-tracking dataset collected in virtual reality

Article Open access 30 March 2023

A dual mobile eye tracking study on natural eye contact during live interactions

Article Open access 14 July 2023

Looking at faces in the wild

Article Open access 16 January 2023

Introduction

Eye tracking, the video-based, non-invasive measurement of an individual’s eye movements, has become a widely used standard technique for measuring visual attention. In recent decades, numerous studies have been published that provide insights into the cause and effect of visual attention in many different contexts, such as reading¹, consumer preferences², emotion processing³, social interaction⁴ and psychopathology (e.g. autism: Ref.⁵). In general, attention has been investigated using well-established behavioral paradigms, such as the spatial cueing tasks⁶, the dot-probe task⁷, or classical search paradigms⁸. In addition, these basic behavioral paradigms have been extended by eye tracking to investigate the role of eye movements for attentional processes, especially in a social context^9,10,11.

The majority of these studies used infrared remote eye trackers to track a participant’s eye movements while observing stimuli displayed on a screen (for review see: Ref.¹²). This technique takes advantage of the fact that stimuli that attract attention are more likely to be looked at¹³. Based on video recordings, translations of corneal reflections under infrared illumination relative to the center of the pupil are used to estimate the orientation of the eyeballs. Following appropriate calibration the gaze point on a given stimulus can be calculated¹⁴.

So far, many eye tracking studies in the social context have used static stimuli (e.g. facial expressions: Refs.^15,16) or dynamic stimuli (e.g. short video clips of social interaction: Refs.^17,18) to investigate attention to social stimuli in clinical and healthy populations. Although these studies have expanded our knowledge regarding attentional factors in the processing of social stimuli, these paradigms lack an important aspect: the possibility to interact with the social stimulus. In line with this notion, studies have reported that the physical presence¹⁹ and the eye contact of the counterpart²⁰ significantly influences the gaze behavior of the participants.

To overcome the limitations of these so-called isolated paradigms^21,22, eye tracking was introduced in real-life interactions with mobile eye trackers that can be worn like ordinary glasses^23,24. Although this approach seems promising, the natural gaze behaviour could be compromised, as seeing the interaction partner wearing eye tracking glasses could lead to a constant feeling of being observed²⁵.

Another approach uses network-based video communication platforms to implement standard display-based eye tracking. In this paradigm, two people communicate with each other via video streaming, while their gaze behavior is tracked on the computer display showing the interaction partner. In recent studies, this approach has been proven feasible for investigating social attention during video-based communication^26,27,28. Although these approaches improve the ecological validity of eye tracking research in social situations, video-based social interactions seem to be affected by the method used for communication²⁹ and the possibility of establishing eye contact is only made possible by complex technical setups³⁰.

To maximize the possibilities for social interaction, at least within dyadic interactions, eye tracking systems in naturalistic setups ideally have the following characteristics³¹: low intrusiveness, flexible setup, robustness against head movement, and high accuracy. In sum, naturalistic eye tracking setups should offer sufficient data quality and at the same time give the participants the feeling of a natural conversation even in a laboratory environment.

Robustness, precision and accuracy are the main basic indices to describe the quality of eye tracking data¹⁴. Tracking robustness (or trackability) as the most basic parameter is calculated as the percentage of available gaze data over a predefined time period. It is determined by dividing the amount of available data by the total number of samples in that period. Data loss may be due to the fact that critical features for measuring eye movements are not detected by the eye tracker (e.g. corneal reflection or the pupil^14,32). Precision can be defined as the average variation of the measured gaze points, where precise data leads to small deviations between individual gaze points, or in other words, compact point clouds of gaze data. Precision can be expressed as the standard deviation of the data sample (SD) or the root mean square (RMS) of inter-sample distances as recommended in the literature¹⁴. Accuracy can be referred to as the trueness of gaze measurement and can be defined as the average mean offset between the measured gaze position and the actual target point on the stimulus display, or in other words, data is accurate when the distance between the centroid of the measured data and the target point is small.

Under standard laboratory conditions (constant distances, luminance, restriction of head movements, etc.), most eye trackers offer excellent robustness of recording, with only minimal proportions of lost data, at least from the majority of participants. Precision can be increased to a certain degree by using low-pass filtering or smoothing of the raw data. Therefore, accuracy can be considered the most crucial and problematic parameter in the majority of studies.

Eye tracker manufacturers usually specify accuracy and precision of their products after testing them under ideal conditions, e.g. display-based setup with the optional usage of a chinrest depending on the product. As an example, for a standard remote infrared eye tracker (e.g. Tobii X3-120), accuracy and precision (SD) has been reported in the product description of the manufacturer with values of 0.4 and 0.47°, respectively³³. A visual angle of 0.4° is equivalent to 4 mm on a computer display at a standard viewing distance of 65 cm. In other words, fixations on two stimuli with a distance of more than 4 mm can be reliably distinguished at a viewing distance of 65 cm.

However, it has been shown that eye trackers typically show lower accuracy and precision when applied under non ideal conditions (e.g. absence of chinrest³⁴). For example the Tobii X2-30 was specified with an accuracy of 0.4° and precision (SD) of 0.26°³⁵, but an independent evaluation estimated an accuracy and precision (SD) of 2.5 and 1.9°, respectively³⁶. This is problematic, since a deviation in accuracy even as small as 0.5° has been shown to significantly change results regarding the gazing durations for multiple areas of interest (AOIs)³⁷. The problem becomes even greater with increasing viewing distance, as is the case in face-to-face conversations. With increasing viewing distance, the spatial resolution on the display plane decreases depending on the level of accuracy. A viewing distance of 150 cm at an accuracy of 0.5° would lead to a spatial resolution of approximately 1.5 cm. Therefore, it is important to evaluate experimental setups in the field of eye tracking research with respect to the research questions, taking into account data accuracy as one of the limiting factors when observing gaze behavior on small areas of interest such as the eye region.

In the following, we introduce an eye tracking setup that has been developed to track the gaze behavior of one participant in a dyadic interaction. The use of a remote eye tracker placed at the rim of the field of view conceals the fact that the gaze is being recorded while maintaining a maximum of freedom in head and body movement. The aim is to achieve high quality eye tracking data without limiting the natural feel of the examined situation. In three independent studies we measured robustness, precision and accuracy of this setup, compared it to a standard display-based setup, tested its feasibility to measure gaze behavior on 3D stimuli, examined the effect of short interruptions and observed data quality and gaze pattern on another person during a face-to-face conversation.

Study 1: Comparison of display-based and scene-based eye tracking

In the first study, we compared an infrared remote eye tracker’s robustness, precision and accuracy in a standard computer display-based setup with the scene-based setup. This scene-based setup can be used to assess gaze behavior for stimuli that are not presented on a computer display but rather in naturalistic environment, such as a conversation partner in social interactions or any other face-to-face scenario. The display-based setup served as a control condition for eye tracking under ideal conditions.

Methods

Participants

Twenty participants were recruited via on-campus announcements at the University of Trier. The exclusion criteria covered visual impairments exceeding ± 0.5 diopters and any type of prescriptive visual aid. Upon arrival at the laboratory, participants completed an online questionnaire to monitor the exclusion criteria and other factors that affect the quality of eye tracking data such as eye infections or allergic reaction leading to watery or irritated eyes. No participants had to be excluded. The final sample comprised of 20 healthy participants (10 male/10 female) aged 26 ± 4 years. The ethics committee at the University of Trier approved the study. The research was conducted in accordance to the Declaration of Helsinki. All participants gave written informed consent and received 10 € after completion of the study. Individuals who are identifiable in the images and video gave informed consent for publication.

Setup and stimuli

In both conditions, data was recorded at a sampling frequency of 120 Hz with a Tobii X3-120 remote infrared eye tracker. Both setups were placed in the same bright room with constant lighting and temperature. Constant lighting was assured by blocking daylight by means of window blinds and using diffuse artificial lighting attached to the ceiling.

Display condition

In the standard display-based setup, the eye tracker was attached to a standard 24 in. monitor (16:9) with a 1920 px × 1080 px resolution connected to a standard PC running a Nvidia Quadro P620 graphics card and Windows 10. The eye tracker was attached via mounting brackets to the bottom of the monitor and tilted by an angle of 25°. Participants were placed at a distance of 65 cm in front of the eye tracker. Eye movement data were recorded with a script written in Python (version 2.7) using the API of Tobii Pro SDK (http://developer.tobiipro.com/python.html). The calibration stimuli were black points on a white background with an initial radius of 1.51 cm (1.5°; see Fig. 2a for an image of the calibration area). The points were horizontally 22 cm (20.5°) and vertically 12.4 cm (11.9°) apart, resulting in a calibration area of 44.0 × 24.8 cm (active display size 53 × 30 cm; 48.4 × 28.5°; see Fig. 2a). Once the stimulus appeared, the participants had 500 ms to fixate on it. After this period, the stimulus shrank continuously in size to a radius of 0.14 cm.

Scene condition

The eye tracker was placed on a table (80 × 80 cm; height: 72 cm) using a fixed monopod (height: approx. 23 cm). The eye tracker was tilted at an angle of 28° and connected to an external processing unit fixated under the table. This unit was connected to a standard PC running Tobii Studio Pro (version 3.4.5) to record eye tracking data and the corresponding video stream. A webcam (Logitech c920) was placed at a height of 140 cm for recording the scene from the participant’s point of view (see Fig. 1) with a 1920 px × 1080 px resolution. Participants sat on a height-adjustable chair in front of the eye tracker at a distance of 65 cm. Before the experiment started, the chair was individually adjusted in height so that each participant sat directly under the camera. The setup with the corresponding distances is shown in Fig. 1. The calibration stimuli were black circles on a white sheet with a radius of 1.45 cm (~ 1.3°). The points were 20 cm (~ 8.7°) vertically and horizontally apart, resulting in a calibration area of 40 × 40 cm (active display size 50 × 50 cm; 21.6 × 21.6° visual angle; see Fig. 2a). The calibration sheet was attached to a light grey vertical partition wall (180 × 80 cm) that could be moved after the calibration procedure was completed.

Design and procedure

The study used a within-subject design to directly compare tracking robustness, precision and accuracy for presentation without a computer display (scene condition) and a standard computer display-based setup (display condition). To control for sequence effects, the order of the conditions was randomized.

Display condition

The condition started with a 9-point-calibration procedure, followed by three 5-point-validation sequences with points at different fixation locations across the calibration area (see Fig. 2a). During the calibration sequence, the points appeared automatically on the screen, and at the end of the sequence the experimenter received numerical feedback on the quality of the calibration. The calibration sequence was repeated up to three times if calibration quality exceeded an accuracy of 1°. Once the calibration was successfully completed, the three validation sequences followed, in which five of the calibrations points were shown in randomized order, resulting in 15 trials per condition per participant. The top center (1), the three points from the middle row (2–4) and the bottom center (5) of the calibration points were chosen as target locations to represent the usual presentation area of stimuli. The points automatically reappeared on the screen, and the program set the markers used for the offline segmentation of the eye tracking data accordingly.

Scene condition

The calibration and validation sequence was adapted for the scene condition. The nine target points of the calibration sequence were printed on a poster and announced by the experimenter. The feedback regarding calibration quality was not automatically quantified by the Tobii software, but provided in form of a visual display which was quantified by an in-house AutoIt script. Again, calibration sequences were repeated up to three times if calibration quality exceeded an accuracy of 1°. The procedure of the following validation sequences differed from the display condition in that the points were announced verbally by the experimenter in 2-s intervals, i.e. the experimenter announced one point and pressed a marker as soon as the participant looked at the corresponding target (visible in a live video provided by the Tobii software Pro Studio) and then waited 2 secs before announcing the next target.

Data analyses

Preprocessing

All calculations were based on averaged binocular data. Segmented raw gaze data was averaged over the three trials per participant. Every single segment comprised of 120 data points (1 s). The raw data of the segments were visually examined for outliers from the centroid by plotting individual gaze data on the corresponding calibration sheet. Of the 600 trials, start markers of 12 trials (2%) had to be corrected to exclude outlier data points. One trial had to be excluded due to experimenter error (0.17%).

Calculation of robustness, precision, and accuracy

Robustness was calculated as the percentage of available data, i.e. by dividing the number of valid data points (as coded in the data output of the Tobii software) by the total number of data points and multiplying by one hundred. Precision (1) as a measure of variance was calculated via the standard deviation (SD) of data samples and is reported in degrees of visual angle and in centimeter accordingly. Again, low values correspond to high precision.

$$\theta_{SD} = \sqrt {\frac{1}{n}\sum\limits_{i = 1}^{n} {\mathop {\left( {x_i - \overline{x}} \right)}\nolimits^{2} } + \frac{1}{n}\sum\limits_{i = 1}^{n} {\mathop {\left( {y_i - \overline{y}} \right)}\nolimits^{2} } }$$

(1)

Accuracy (2) was calculated as the mean offset of the gaze point position in relation to the target point position and is reported in degrees of visual angle and in centimeter. Thus, low values correspond to high accuracy.

$$\theta_{offset} = \sqrt {\mathop {\left( {x_{target} - \overline{x} \, } \right)}\nolimits^{2} + \mathop {\left( {y_{target} - \overline{y}} \right)}\nolimits^{2} } \,$$

(2)

Calculations of basic quality indices were conducted via in-house scripts in Matlab (2019, version 9.7.0).

Statistical analyses

The influence of the within-subject factors condition (display vs. scene) and target location (target locations 1–5 of the validation sequences) were tested for robustness, precision, and accuracy with repeated measures ANOVAs. Follow-up repeated measures ANOVAs with the factor target location and the respective post-hoc comparisons were conducted for each condition separately. Post-hoc single comparisons are reported with Bonferroni correction. In cases of heterogeneity of covariance (Mauchly test of sphericity) Greenhouse–Geisser corrections were applied. All statistical analyses were done in R version 4.0.0³⁸ with the level of significance set to p < 0.05. Effect sizes are reported as ƞ²_G to ensure comparability with other studies³⁹.

Results

Effects of condition and target location on robustness, precision and accuracy

The 2 (condition) × 5 (target location) ANOVA for robustness revealed a significant main effect of condition, F(1,19) = 45.38, p < 0.001, ƞ²_G = 0.32, but neither a significant main effect of target location, F(2.98,56.64) = 0.79, ε = 0.75, p = 0.502 , ƞ²_G = 0.01, nor a significant interaction effect, F(2.99,56.79) = 0.89, ε = 0.75, p = 0.454, ƞ²_G = 0.01. The robustness was better in the scene compared to the display condition (see Table 1).

Table 1 Eye tracking robustness, precision and accuracy under standard laboratory conditions for a standard display setup (display condition) and the scene-based setup (scene condition). Values in degrees of visual angle and cm.

Full size table

The ANOVA for precision (SD) with the within-subject factors condition and target location showed a significant main effect of condition, F(1,19) = 7.79, p = 0.012, ƞ²_G = 0.08, a significant main effect of target location, F(2.54,48.22) = 4.77, ε = 0.63, p = 0.008, ƞ²_G = 0.08, and a significant interaction effect, F(2.62,49.79) = 11.11, ε = 0.66, p < 0.001, ƞ²_G = 0.14. Descriptive data show that precision of data was better in the scene condition compared to the display condition (see Table 1).

The follow-up repeated measures ANOVA revealed a significant main effect of target location in the display condition, F(1.70,32.31) = 10.43, ε = 0.43, p < 0.001, ƞ²_G = 0.29, as well as in the scene condition, F(3.21,61.03) = 3.68, ε = 0.80, p = 0.015, ƞ²_G = 0.09. Post-hoc single comparisons in the display condition showed that the following points differed significantly in precision: 1 vs. 5, t(76) = − 6.18, p < 0.001, r = 0.58, 2 vs. 5, t(76) = − 4.70, p < 0.001, r = 0.48, 3 vs. 5, t(76) = − 3.41, p = 0.011, r = 0.34) and 4 vs. 5, t(76) = − 3.44, p = 0.009, r = 0.37, with reduced precision at point 5. The post-hoc comparisons for the scene condition revealed a significant difference between location 2 vs. 5, t(76) = 3.02, p = 0.034, r = 0.33 and location 4 vs. 5, t(76) = 3.12, p = 0.026, r = 0.34, with better precision at location 5.

The same ANOVA for accuracy revealed no significant main effect of condition, F(1,19) = 3.71, p = 0.069, ƞ²_G = 0.04, no significant effect of target location, F(3.15,59.81) = 2.45, ε = 0.79, p = 0.070, ƞ²_G = 0.04, but a significant interaction of condition and target location, F(2.51,47.63) = 4.15, ε = 0.63, p = 0.015, ƞ²_G = 0.07, indicating that target location differentially influenced tracking accuracy in the display and scene condition. Accordingly, the follow-up repeated measures ANOVAs revealed that in the display condition tracking accuracy differed between the target locations, F(2.88,54.70) = 3.74, ε = 0.72, p = 0.018, ƞ²_G = 0.11. Accuracy values for each target location in the display condition can be found in Fig. 2b. Post-hoc comparison for the display condition revealed significant differences between locations 1 vs. 4, t(76) = − 2.91, p = 0.047, r = 0.24 and locations 3 vs. 4, t(76) = − 3.17, p = 0.022, r = 0.34, with better accuracy at locations 1 and 3. In contrast, in the scene condition tracking accuracy was equal for all target locations, F(3.22,61.21) = 2.00, ε = 0.81, p = 0.119, ƞ²_G = 0.07.

Discussion

In sum, the results of the first study showed that both display-based and scene-based eye tracking is possible with comparable robustness, precision and accuracy. Using a standard display-based setup, we were able to reproduce previously reported values of precision and accuracy for the specific eye tracker used in the present study. In the scene-based setup tracking for a 2D stimulus, average accuracy is approx. 0.34° which corresponds to approx. 8 mm at a viewing distance of 131 cm. Thus, using the specific eye tracker (in this study a Tobii X3-120) in a scene-based setup can provide data that allows distinguishing gaze points that are at least 1.6 cm apart, taking into account the imprecision of the eye tracker. This accuracy seems to be sufficient for most studies in the context of social interactions, as it enables a valid distinction to be made for fixations on common regions of interest, such as the eyes and the mouth or other facial features.

However, accuracy obviously depends on the location of the target. The central location showed the best data quality in the display condition, while data quality declined towards the edges of the calibration area with increasing visual angle. This is in line with previous reports (e.g. Ref.³²) and indicates that stimuli should be ideally placed in the center of the calibration area. However, the present results suggest that in the scene condition tracking quality, especially accuracy, is less dependent on stimulus location, as there were no significant differences in accuracy for the different target locations during validation. This might be mainly due to the lower visual angle covered by the calibration area in the scene condition, compared to the display condition.

Study 2: Data quality for 2D and 3D and reliability of calibration

In the second study we aimed to test eye tracking data quality of the setup for 3D compared to 2D stimuli and the reliability of eye tracking quality by dividing the study into two runs without re-calibration in between. In addition, we used the same validation procedure as in study 1 in order to replicate the findings regarding basic quality indices.